wikiclip

a tool to find wikipedia articles based on an image
background

I still remember my first Wikipedia search. After spending years reading physical encyclopedias and hours in Encarta, I stumbled upon Wikipedia. I was mesmerized, but didn't tell my parents out of the fear that this "wiki" was somehow derived from Wicca (the religion).

I love the process of stumbling upon new articles. A few years ago, I made a site called Wikitrip where you can find articles that are nearby. I love using it on a road trip or in a new city to find interesting places around me.

But there are only so many articles with a location. Often, you want to know more about what you see around you. I wanted a tool that could look at an image, and return the most relevant articles on Wikipedia.

methods

The basic premise is to use CLIP to embed the text of a number of articles from Wikipedia. I'd then embed the query image and find the nearest matches. CLIP is naturally suited for this task, as it embeds both images and text into a shared space.

data

The first challenge was scale. English Wikipedia has nearly seven billion articles. Pulling all those articles would be infeasible. Instead, I tried pulling articles alphabetically, but found only odd articles that were odd and uninteresting.

Instead, I took a few different approaches to pull a better sample. Wikipedia editors are meticulous. They maintain a list of "Vital Articles" at five different levels of importance. At level five, there are about 50,000 articles.

Wikipedians also classify articles into quality categories. I pulled all articles classified as Good Articles, and all articles classified as A-Class Articles. I pulled about 160,000 articles from these categories, though they were not all unique as they overlap with the vital articles.

Finally, I retrieved the 1,000 most popular articles from each month from 2016 through the present. This didn't add too much more data, but it did provide a few more interesting articles.

embeddings and database

CLIP can only process 77 tokens of text at a time, so I extracted either the page "extract" (roughly the first paragraph of the article) or the article title if an extract wasn't available. I then used CLIP to embed the text and stored the embedding, article tile, page URL, and article numeric identifier in a SQLite database. In hindsight, I could have saved only the embedding and numeric identifier.

In the end, I added 104,686 articles to my database, and the db file comes to 432 mb in size. The whole process was reasonably quick. Running on my Macbook, it took about 45 minutes to pull the text and embed the articles.

querying the database

Finding articles most similar to an image in the database is pretty simple. I use the the same CLIP model to embed an image, and then use a simple dot product on the normalized vectors to find the most similar articles. This takes a second or so per query.

results

I'm quite happy with how it all works! Given the limitations of CLIP and the limited number of articles, I find it to be quite delightful at surfacing articles that are similar, but not an exact match. Here are a few examples:

looking at objects in the real world
querying with an illustration
querying with a photo
conclusion

Overall, I'm pretty happy with the results and being able to put it together in a weekend. My biggest regret is not hosting this on a server so it can be used by anyone. Given that I host this site on GitHub Pages, I need somewhere to host the server. I may take this on, but it's a heavier lift. Maybe soon!

I also find this to be a really great way to probe how well CLIP embeds text. The results that come back give pretty good insight on what CLIP picks up on and what it misses.

what would i do next
  • Certainly hosting this on a server to let other people try would be the most obvious next step.
  • I used the smallest CLIP model. I'm curious if the results would get better with a larger model.
  • I became a bit of a data hoarder with the Wikipedia scraping. Maybe I'd scrape all of the "B-Class" articles to get a lot more data.
  • I'll share the code soon. It might be useful for others to see, particularly the data scraping portions.