What Do Embedding Models See?

We’re going to explore a bit about how AIs see, organize, and make sense of the visual world by doing a few interactive experiments.

Organizing the World

To be able to understand the world, vision models need to convert an image into a language that they understand. For an AI model, that means a list of numbers. These numbers work a bit like the Dewey Decimal System at the library. Books are assigned a number based on their topic, and similar topics have shared numerical values. This means that all cookbooks start with numbers in the 640s, and astronomy books are listed in the 520s.

It’s very important that these numbers are well structured. Suppose that every book in the library were assigned a random number and laid out on the shelves in numerical order. Instead of finding a group of books of a similar topic on one shelf, you’d have to hunt through the entire library searching for books on one topic of interest.

AI models need to act as librarians. They take in images and organize them by transforming each one into a list of numbers. As with the library, images containing similar content should have numbers that are closer in value to each other. This list of numbers is called an “embedding vector”, or an “embedding” for short.

Embedding Models

AI models create these powerful embeddings to organize the world, but they don't reveal how they are structured or what they can see. We're going to explore this!

Of course we should be curious about how an AI sees the world, but it also is very important to determine the model's capabilities. We'd like to know where these models are sensitive, and where they are blind. What things do AI models see as similar, and what do they consider to be very far apart?

We're going to evaluate this through a set of explorations. Each exploration gives an AI model two inputs and asks "how similar do you think these are?"

Measuring Similarity

Given that embedding vectors have many hundreds of values, how can we measure how similar one is to another? We have a nice technique called cosine similarity. Here’s how it works.

We’ll first treat the list of numbers in an embedding vector as coordinates of a point in a space. Each number describes a position along a different axis. To make it simple, we’ll start with an embedding that has only two dimensions.

In our example, we’ll give the first embedding vector the values 3 and 1, and the second 1 and 2. We’ll label these positions on the graph by setting the blue point to , and the green point to .

You’ll notice that we can draw the angle between the lines that extend from the origin of the graph to the points. In this case, that angle is 45˚. This angle indicates how similar the points are to one another. A smaller angle indicates that the two points are similar, whereas a larger angle indicates that they’re further apart. This is our similarity score.

To make this number nicer to work with, we’ll take the cosine of this angle. We do this to convert the angle to a scale between -1 and 1. This is the cosine similarity score between two embeddings. You can see this score in blue above the plot.

Drag the points around to see how the score changes. A cosine similarity value close to 1 indicates that the points are nearly the same, 0 means they’re perpendicular, and -1 means that they’re pointing in exactly opposite directions.

This technique that works simply in two dimensions translates very well to higher dimensionalities, and is what we’ll use to compare embeddings created by AI models.

Image Similarity

How can we examine how these embedding models see the world? Given that the embeddings themselves are mostly inscrutable, we’ll look instead at the similarity of embeddings produced for two different images. Through this process, we’ll build an intuition for how a model sees.

For each experiment, we're going to use two popular embedding models. One is called CLIP and the other is called DINO. In each test, these models will compute embeddings for the images on the left and right, and then measure the similarity of these two embedding representations. The similarity scores for each of the models will be shown in the scoreboard above the images.

We'll start by comparing . As you'd expect, the similarity score is very high. The CLIP model gives it a score of 0.92 and the DINO model gives it a score of 0.95. Even though the images are a little different, the models see them as similar.

Now let's see what happens when we look at a . Once again, the scores are still quite high, but a little lower than before. Everything is behaving as we'd expect.

Now let's try a few more experiments. Let's compare . The CLIP model thinks that they are even more similar than the two gray parrots, but the DINO model thinks that they are quite far apart. This is interesting! It means that the CLIP model groups them together as dogs, but the DINO model thinks that they are significantly different.

Make some comparisons of your own. You might try comparing a , a , or a .

In doing these tests, we can start to build up intuition for how the models perceive the images. It seems like the DINO model is more sensitive to differences in images, while the CLIP model might group at a more conceptual level. CLIP might be saying something like "these are all animals to me", whereas DINO is saying "this dog looks nothing like that other dog!" Both are equally correct, but it gives a clue toward how these models represent the world.

Rotation

Now that we’ve looked at the similarity between different images, let’s see what happens if we transform the same image. We’ll start by looking at rotation. As a human, you’d likely say that a rotated image is similar to the original, but a little different. Do the models say the same?

We’ll start with a . You can see that CLIP finds that the two images are highly similar (0.98), while DINO sees them further apart (0.91).

The chart on the bottom shows how the similarity score changes as a function of angle. You can see that the DINO model thinks the images are most different when .

A butterfly is a concrete semantic target. With a minimum similarity score of 0.975, it seems that CLIP consistently puts it into the same “butterfly” region of embedding space.

But let’s take a look at a different type of image. We’ll use a . This is a less-concrete target, and one without a notion of “proper” orientation.

Now the CLIP model starts behaving more like what we’ve seen from the DINO model. It becomes more sensitive to rotations and finds to be highly dissimilar.

Explore some of the other images. You might see periodic similarities with , a stronger reaction to uncommon scenarios when looking at an , or a puzzling plot when looking at an .

3D Rotation

Turning an image upside-down and seeing how the models respond is interesting, but not always realistic. After all, does it matter what a model thinks of an upside-down elk? It would be more informative if we could view the same object from a different perspective and see what the models think.

We'll do exactly this by rendering a 3D model from different perspectives and comparing the similarity of the embeddings. We’ll start with a .

These rotation plots are more complex to analyze. We see that for a , the CLIP model maintains fairly high similarity, whereas DINO similarity abruptly drops off as soon as the back of the head dominates the frame. Or how the models see geography when looking at a .

Color

Lastly, we’ll look at transforming the color of the image to understand whether color is an important feature to these models.

We’ll start by changing the color of an egg yolk. Though in most other cases the DINO model was much more sensitive than CLIP, the roles are reversed here. The DINO model finds all colors to be fairly similar, whereas the CLIP model has significantly more variation.

The same pattern holds when looking at a . It’s quite interesting to see which color combinations CLIP finds to be most dissimilar.

Conclusions

Generally, it seems as if CLIP is more semantic as it groups similar concepts together a little more tightly. This is in contrast to DINO, which seems to look at the images more literally. Most of these results make sense when you look at how the models were trained. DINO with purely image-based training, and CLIP with a training objective to align text and image spaces. Neither approach is necessarily better, but they can have different strengths.

Similarity is a very crude metric for evaluating these incredibly rich embedding vectors, but it gives a glimpse at how the models organize the world by placing images within their embedding spaces. By playing with these examples I have a better intuition for how these models behave and I hope you do too.