I still remember my first Wikipedia search. After spending years reading physical encyclopedias and hours in Encarta, I stumbled upon Wikipedia. I was mesmerized, but didn't tell my parents out of the fear that this "wiki" was somehow derived from Wicca (the religion).
I love the process of stumbling upon new articles. A few years ago, I made a site called Wikitrip where you can find articles that are nearby. I love using it on a road trip or in a new city to find interesting places around me.
But there are only so many articles with a location. Often, you want to know more about what you see around you. I wanted a tool that could look at an image, and return the most relevant articles on Wikipedia.
The basic premise is to use CLIP to embed the text of a number of articles from Wikipedia. I'd then embed the query image and find the nearest matches. CLIP is naturally suited for this task, as it embeds both images and text into a shared space.
The first challenge was scale. English Wikipedia has nearly seven billion articles. Pulling all those articles would be infeasible. Instead, I tried pulling articles alphabetically, but found only odd articles that were odd and uninteresting.
Instead, I took a few different approaches to pull a better sample. Wikipedia editors are meticulous. They maintain a list of "Vital Articles" at five different levels of importance. At level five, there are about 50,000 articles.
Wikipedians also classify articles into quality categories. I pulled all articles classified as Good Articles, and all articles classified as A-Class Articles. I pulled about 160,000 articles from these categories, though they were not all unique as they overlap with the vital articles.
Finally, I retrieved the 1,000 most popular articles from each month from 2016 through the present. This didn't add too much more data, but it did provide a few more interesting articles.
CLIP can only process 77 tokens of text at a time, so I extracted either the page "extract" (roughly the first paragraph of the article) or the article title if an extract wasn't available. I then used CLIP to embed the text and stored the embedding, article tile, page URL, and article numeric identifier in a SQLite database. In hindsight, I could have saved only the embedding and numeric identifier.
In the end, I added 104,686 articles to my database, and the db file comes to 432 mb in size. The whole process was reasonably quick. Running on my Macbook, it took about 45 minutes to pull the text and embed the articles.
Finding articles most similar to an image in the database is pretty simple. I use the the same CLIP model to embed an image, and then use a simple dot product on the normalized vectors to find the most similar articles. This takes a second or so per query.
I'm quite happy with how it all works! Given the limitations of CLIP and the limited number of articles, I find it to be quite delightful at surfacing articles that are similar, but not an exact match. Here are a few examples:
Overall, I'm pretty happy with the results and being able to put it together in a weekend. My biggest regret is not hosting this on a server so it can be used by anyone. Given that I host this site on GitHub Pages, I need somewhere to host the server. I may take this on, but it's a heavier lift. Maybe soon!
I also find this to be a really great way to probe how well CLIP embeds text. The results that come back give pretty good insight on what CLIP picks up on and what it misses.