Jump to content

User:Yaron Koren/AutoCats

From Wikimedia Commons, the free media repository
AutoCats display (light green box)

AutoCats is a user script that displays automatically-generated "categories" for each file on Wikimedia Commons, based on the structured data stored for that file, in a box at the bottom of the page under the regular categories box. It can be enabled by adding the following line to your common.js page, i.e. the page "User:YourUsernameHere/common.js":

importScript("User:Yaron Koren/AutoCats.js");

You can view the script here.

AutoCats works by cycling through the structured data stored for a file and finding combinations of one or more attributes of that file that together can form a meaningful "category", where a "category" here means a set of similar (or at least similarly-annotated) files that is of a reasonable size (ideally between 50 and 100 files). A maximum of 10 auto-categories are displayed for any one file.

Rationale and structured data

[edit]

Wikimedia Commons has an unusual setup in which information about each file is actually stored in three places: directly on the page, in text form; in its categories; and as structured data, visible in each page's "structured data" tab (and in the "caption" field on the main screen, which really also counts as structured data). There is not complete overlap between all three of these information sets, but there is overlap: for a photograph, for example, the date and location that the photograph was taken can often be found in some form in all three locations. This represents duplication of both data and effort, which is not ideal. And it's especially unusual for Wikimedia Commons, because Commons is intended to be a global resource, accessible to users who speak any language; but both the text data and the categories can only be stored in one language (theoretically these can be in any language, though seemingly 99% of the time they are in English). No matter the language of a category or descriptive text, it is guaranteed that a majority of the world's potential readers will not be able to read it.

The "structured data" is really the ideal place to store data about a file, not just because it is fully internationalizable, but also because it is extremely powerful: once you specify the basic attributes of a file, like what is shown ("depicts"), when it was created ("inception"), what style it is in ("genre"), and so on, that data can then be queried in any number of ways, to group files together to fit any sort of criteria, without the need to do any more work in terms of creating categories or lists.

What has been missing, however, are ways for users to make use of this structured data, short of generating their own queries in SPARQL, the language used by Commons' structured data query service. This is hopefully where AutoCats and other tools can come into play.

Example

[edit]

Let's take an example of a file on Commons with a well-populated set of structured data: this photo of the luminous Altes Rathaus, or former town hall, of the German town of Haltern am See:


If you have AutoCats installed, and English set as your language, you will see something like this near the bottom of the page, with the new "auto-category" box being placed directly under the real category box (the display of "hidden categories" was turned off here, for the sake of readability):


You can see that each auto-category is formed by combining one or more properties (in this case no more than two, though some files will have auto-categories combining three or more). Values for different properties are separated by a comma, while values for the same property are separated by an ampersand ("&").

There is - perhaps surprisingly - no indication of the meaning of each value: whether each word or phrase is the location, or what is being depicted, or the photographer, etc. It was a conscious choice to not include these property names, because they seemed to be making the overall auto-category names too long, and in most cases it was clear from context what the meaning of each value was. However, you can see the property names included if you hover over any link:


Comparing the auto-category names to the real category names, you can see that they are not altogether different: the real category names are generally worded more nicely, and tend to be more specific (in some cases oddly so); but the sets of files that you would expect to see by clicking on the real categories are, at least in some cases, probably not that different from the set you would expect to see from the auto-categories.

The big advantage of auto-categories (besides that they took a lot less work to generate) can be seen when looking at Commons in a language other than English. (Though the first category listed is in German.) Here is the display for that same page, when viewing the site in Japanese:


The main category box is included here to highlight the marked difference of the AutoCats display from the existing one. For a user with no knowledge of English (or German), the AutoCats display is not perfect, as many names have not been fully translated on Wikidata; but it is far more comprehensible than the standard categories display.

You may also notice that the set of auto-categories shown here is not quite the same as the one shown for English, even setting aside the language difference. That is because each load of the file page will actually produce the a slightly different set: the categories are generated and displayed in real time based on queries, so the exact speed of each query on each run will determine the order (and, if there can be more than 10) inclusion of auto-categories.

The "category" pages and Commons Walkabout

[edit]

By now you also may be wondering, where do all these links go? They go to the site Commons Walkabout, which provides an interface for viewing the data on Wikimedia Commons, filtered by property. Commons Walkabout is really the "secret weapon" behind AutoCats: it provides an easy viewing interface to be able to see the full set of each of these auto-categories - and then to return to Commons by clicking on any one thumbnail. Here is the display on Commons Walkabout if you click on the auto-category that was hovered over before; you can also view the screenshot here:


The interface provides ways to add and remove filters, shrinking or enlarging the set of results shown - which is not identical, but hopefully analogous, to the interface provided by MediaWiki's category pages.

Like AutoCats, Commons Walkabout is, at least in theory, fully internationalized. You can click here to see what the user viewing Commons in Japanese would see if they clicked on that same link; or see this screenshot:


Sadly, not everything is translated into every language: not the main values, not the property names, not the Commons Walkabout interface, and certainly not the caption for every photo; but the mechanism at least exists to translate all of these things.

License and credits

[edit]

AutoCats is an open-source script written in JavaScript, available under the MIT license. It was written by Yaron Koren, and first released in March 2025.

Translation and feedback

[edit]

It is rather ironic that this tool, which has as one of its main benefits readability in multiple languages, is currently documented only in English! Translations are welcome.

Any comments are welcome also, on the talk page.