Learn Creative Coding (#102) - Embeddings and Similarity

cc-banner

Last episode I left you on a cliffhanger. We were doing audio, and I kept coming back to this one idea: once you turn a sound into a vector of numbers, you can ask how close two sounds are. Not "is this a clap, yes or no", but "this noise sits near that noise in some space of meaning". I said it works for sound, for images, for words, for anything you can turn into a vector, and that next time we would look straight at it.

So here we are. This is the episode about that one idea. It is called embeddings, and honestly it might be the single most useful ML concept for creative coding that nobody explains properly. Most tutorials either skip it or drown it in linear algebra. Let me try to do it the way I wish someone had done it for me :-).

What an embedding actually is

An embedding is just this: you take some messy, complicated thing - an image, a word, a sound, a whole document - and you turn it into a fixed-length list of numbers. A vector. Maybe 128 numbers, maybe 512, maybe 1024. And the magic is that this list isn't random. It is arranged so that similar things get similar numbers.

That is the whole trick. Two photos of beaches end up with vectors that are close together. A photo of a beach and a photo of a server room end up far apart. The model has squeezed "what this thing is like" down into coordinates.

Why do we care as artists? Because the moment something is a point in space, distance becomes meaningful. "How similar are these two images?" stops being a vague human question and becomes a number you can compute, sort by, animate, map to color. Similarity turns into a creative parameter, same as mouseX or frameCount. That is what makes this worth a whole episode.

Way back in episode 96 we used MobileNet to classify images - "this is a tabby cat, 92%". But classification throws away almost everything. It collapses a rich image down to one label. An embedding keeps the richness. Instead of the final "cat" answer, we grab the layer just before that - the 1024 numbers the network computed on its way to deciding. Those numbers are the image's fingerprint.

Distance is the whole game

Before any fancy models, let me show you the two functions you will use constantly. Given two vectors, how far apart are they? There are two common answers.

// Euclidean distance: straight-line distance, like measuring with a ruler
function euclidean(a, b) {
  let sum = 0;
  for (let i = 0; i < a.length; i++) {
    let d = a[i] - b[i];
    sum += d * d;
  }
  return Math.sqrt(sum);
}

That is the obvious one, the Pythagoras-in-many-dimensions version. But for embeddings the more useful one is usually cosine similarity, wich measures the angle between two vectors rather than the raw distance. It ignores how long the vectors are and only cares about which direction they point.

// Cosine similarity: 1 = identical direction, 0 = unrelated, -1 = opposite
function cosineSimilarity(a, b) {
  let dot = 0, magA = 0, magB = 0;
  for (let i = 0; i < a.length; i++) {
    dot  += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }
  return dot / (Math.sqrt(magA) * Math.sqrt(magB));
}

Why angle instead of length? Because in embedding space the direction is what carries the meaning. A bright photo and a dark photo of the same beach might have different vector lengths but point the same way. Cosine catches that "same way" and shrugs off the brightness. For most creative work, reach for cosine first. Makes sense, right?

Turning an image into a vector

Now let us actually produce an embedding. In the browser, TensorFlow.js gives us MobileNet, and it has a quiet little option that does exactly what we want.

// load MobileNet once (tensorflow.js + the mobilenet model)
let mobilenetModel;

async function loadModel() {
  mobilenetModel = await mobilenet.load();
}

// the second argument `true` means: give me the embedding,
// NOT the classification. You get back a 1024-length vector.
function embedImage(imgElement) {
  const activation = mobilenetModel.infer(imgElement, true);
  const vector = activation.dataSync();   // Float32Array, length 1024
  activation.dispose();                   // free GPU memory, important!
  return Array.from(vector);
}

That true flag is the whole episode in one line. Without it MobileNet tells you "golden retriever". With it, MobileNet hands you the 1024 numbers it was about to turn into "golden retriever". Those numbers describe the image in a way the network finds meaningful, and that turns out to be incredibly useful raw material.

The dispose() matters more than it looks. tf.js runs on the GPU and does not garbage-collect tensors for you. Forget to dispose and a sketch that embeds a few hundred images will quietly eat all your memory and crash the tab. Ask me how I know. At work I once left an embed loop running over a folder of images overnight and came back to a frozen machine and a very confused colleague :-).

Nearest neighbours: a visual recommender

Here is the first genuinely fun thing you can build. You have a collection of images, each turned into a vector. Someone points at one image and asks "show me the ones most like this". You just compute the similarity to every other image and sort.

// collection = [{ img: p5Image, vec: [...1024 numbers] }, ...]
// query = one of those entries
function nearestNeighbours(query, collection, count) {
  return collection
    .filter(item => item !== query)
    .map(item => ({
      item: item,
      score: cosineSimilarity(query.vec, item.vec)
    }))
    .sort((a, b) => b.score - a.score)   // highest similarity first
    .slice(0, count);
}

Wire that into a p5 sketch and you have a little visual search engine. Click an image, and the five most visually similar images line up next to it.

let collection = [];   // pre-embedded images
let query = null;
let neighbours = [];

function setup() {
  createCanvas(900, 400);
  // assume collection is already filled and embedded
  query = collection[0];
  neighbours = nearestNeighbours(query, collection, 5);
}

function draw() {
  background(18);

  // the query image, big, on the left
  image(query.img, 20, 20, 200, 200);
  noStroke();
  fill(180);
  textFont('monospace');
  text('query', 20, 240);

  // its nearest neighbours, smaller, on the right
  for (let i = 0; i < neighbours.length; i++) {
    let n = neighbours[i];
    let x = 260 + i * 130;
    image(n.item.img, x, 40, 110, 110);
    fill(140);
    // show the similarity score under each
    text(n.score.toFixed(3), x, 165);
  }
}

function mousePressed() {
  // click anywhere in the strip to pick a new query later
  query = collection[floor(random(collection.length))];
  neighbours = nearestNeighbours(query, collection, 5);
}

Notice there is no "category" anywhere in this code. We never told it what a beach is, or a face, or a sunset. It just knows that these vectors are close to that vector, and closeness in MobileNet's space lines up shockingly well with "looks similar to a human". That alignment is the gift the pre-trained model gives us for free.

Seeing the space: 1024 dimensions on a flat screen

Okay, but a 1024-dimensional vector is impossible to picture. Our screens are stubbornly 2D. So how do you actually look at an embedding space?

You squash it. There are algorithms whose entire job is to take high-dimensional points and lay them out in 2D so that points that were close in 1024D stay close in 2D. The famous two are t-SNE and UMAP. You do not need to implement them - they exist as libraries - you just need to understand what they promise.

// conceptual: reducing 1024D embeddings down to 2D coordinates
//
// input:  [ [1024 numbers], [1024 numbers], ... ]   one per image
// output: [ [x, y], [x, y], ... ]                    one 2D point per image
//
// t-SNE  -> preserves LOCAL structure beautifully (tight clusters)
//           but distances between clusters are not meaningful
// UMAP   -> faster, and preserves more GLOBAL structure
//           (the layout between clusters means something too)
//
// both are "give me points that keep neighbours as neighbours"
// you feed in vectors, you get back a 2D scatter you can draw

Once you have 2D coordinates, drawing them is pure creative coding. And this is where it connects straight back to episode 85, where we did network and graph visualization - laying out nodes so related things sit near each other. This is the same instinct, except the "relatedness" now comes from a neural network instead of edges you defined by hand.

// points = [{ img: p5Image, x: 0..1, y: 0..1 }, ...]  from UMAP
function draw() {
  background(12);
  for (let p of points) {
    let px = map(p.x, 0, 1, 60, width - 60);
    let py = map(p.y, 0, 1, 60, height - 60);
    // draw each image small, as a thumbnail, at its embedded position
    image(p.img, px - 20, py - 20, 40, 40);
  }
}

Run that over a thousand images and something lovely happens. Beaches drift into one corner. Faces clump together. Dark moody photos pool on one side, bright airy ones on the other. Nobody sorted them. The arrangement emerged from the embeddings. You built a gallery where physical position equals visual similarity - an image landscape you can wander through.

Words are vectors too

Here is the part that broke my brain a little when I first met it. The same trick works for words. Models like Word2Vec and GloVe assign every word a vector, learned from reading enormous piles of text. And because of how they learn - words that show up in similar contexts get similar vectors - the geometry ends up encoding meaning.

ml5 wraps this up so we can play with it in the browser.

let wordModel;

function preload() {
  // a pre-trained word vector file (words -> vectors)
  wordModel = ml5.word2vec('data/wordvecs10000.json');
}

function setup() {
  noCanvas();
  // ask: what words are nearest to "ocean"?
  wordModel.nearest('ocean', 8, (err, results) => {
    if (err) return console.log(err);
    // results = [{ word: 'sea', distance: ... }, { word: 'waves' }, ...]
    results.forEach(r => console.log(r.word, r.distance.toFixed(3)));
  });
}

Ask for the nearest words to "ocean" and you get back "sea", "waves", "coastal", "shore". The model never read a dictionary. It just noticed that those words keep the same company, and that statistical fact landed them in the same neighbourhood of the space.

Arithmetic on meaning

And now the famous one, the example that made word embeddings go viral. You can do algebra with words.

// king - man + woman = ?
wordModel.subtract(['king', 'man'], (err, kingMinusMan) => {
  wordModel.add([kingMinusMan.word, 'woman'], (err, result) => {
    console.log(result.word);   // -> "queen" (approximately!)
  });
});

Read that again. "King" minus "man" plus "woman" lands you near "queen". The vector difference between king and man captures something like "royalty without the maleness", and adding "woman" back in moves you to the female version. The model learned a direction in space that means "gender", another that means "royalty", without anyone ever defining those concepts. They fell out of the statistics.

It is not perfect - sometimes you get "queen", sometimes "monarch", sometimes something charmingly wrong. But that it works at all is one of those facts that should feel like science fiction and somehow doesn't anymore.

Walking through the space

For us the real prize isn't the trivia, it is navigation. If a word is a point and meaning is direction, then you can take a walk. Start at one concept, head toward another, and sample the meaning along the way.

// walk from one word's vector to another in N steps,
// printing (or rendering) what each step is nearest to
function semanticWalk(startVec, endVec, steps, cb) {
  for (let i = 0; i <= steps; i++) {
    let t = i / steps;
    // lerp every dimension of the vector
    let mid = startVec.map((v, idx) => lerp(v, endVec[idx], t));
    cb(t, mid);   // hand back the interpolated point to use
  }
}

This is interpolation, exactly like the easing and lerp we did way back in episode 16 - except we are lerping through a space of meaning instead of a space of pixels. Start at the vector for "ocean", end at the vector for "fire", take twelve steps, and ask what each step is nearest to. You might trace ocean, water, steam, heat, flame, fire. The transition isn't random fading - it is a meaningful journey through related concepts. Feed those words to a generative system and your visuals morph semantically, not just visually.

And remember the latent-space walks from the GAN episodes (97 through 99)? Same idea, different space. There we glided through a space of generated faces. Here we glide through a space of words. Once you see "smooth movement through a learned space" as a tool, you start spotting places to use it everywhere.

Arithmetic on images

If words have a "gender direction", do images have directions too? They do. This is exactly what we were sneaking up on in the GAN episodes when we added a "smile vector" to a face. The general recipe:

// find a "concept direction" from example images, then apply it
// e.g. the "add glasses" direction:
//   average(faces WITH glasses) - average(faces WITHOUT glasses)
function conceptVector(withExamples, withoutExamples) {
  let dim = withExamples[0].length;
  let avgWith = new Array(dim).fill(0);
  let avgWithout = new Array(dim).fill(0);

  for (let v of withExamples)
    for (let i = 0; i < dim; i++) avgWith[i] += v[i] / withExamples.length;
  for (let v of withoutExamples)
    for (let i = 0; i < dim; i++) avgWithout[i] += v[i] / withoutExamples.length;

  // the direction that means "the thing that's different between these groups"
  return avgWith.map((v, i) => v - avgWithout[i]);
}

Compute the "glasses direction" once from a handful of examples, and then you can push any face vector along it to add glasses, or pull it the other way to remove them. The same works for "more sunset", "more snow", "older", whatever distinction your example sets capture. You are sculpting in meaning-space, and the model turns your sculpting back into pixels.

One space to rule them: CLIP

I have saved my favourite for last. Everything so far kept words and images in separate spaces. But there is a model from OpenAI called CLIP that embeds both text and images into the same space. The phrase "a red circle on a blue background" gets a vector that sits right next to the vector of an actual image of a red circle on blue.

Think about what that unlocks.

// conceptual: CLIP gives text and images vectors in ONE shared space
//
//   clipEmbedText("a stormy seascape at dusk")   -> [512 numbers]
//   clipEmbedImage(someImage)                     -> [512 numbers]
//
// because they live in the same space, you can compare across types:
//   cosineSimilarity(textVec, imageVec)
//
// search a folder of images by typing a sentence:
function searchByText(phrase, imageCollection) {
  let q = clipEmbedText(phrase);
  return imageCollection
    .map(item => ({ item, score: cosineSimilarity(q, item.vec) }))
    .sort((a, b) => b.score - a.score);
}

You can search your own images by describing them. Type "lonely", get back the images that feel lonely. You can also point it the other way and ask CLIP to score how well a generated image matches a target phrase - which means you can even use it as a judge, nudging a generative system toward "more like the words I wrote". Text becomes a steering wheel for visuals. We will not wire up the full thing here (CLIP is a chunky model), but I wanted you to know the door exists, because it changes how you think about the whole pipeline.

The exercise: a map of your own work

Allez, let us pull it all together into one piece, and make it personal. Over this whole series you have been making sketches. Take screenshots of them - the more the better, dig back through your folders. Embed every screenshot with MobileNet, project them to 2D with UMAP, and render them as an interactive scatter plot where every point is a thumbnail of your own work. Click a thumbnail to blow it up. It becomes a map of your creative coding journey, where similar pieces sit together and the odd experiments stand alone as outliers.

let works = [];   // [{ img, vec, x, y }] -- embedded + UMAP-projected
let hovered = null;

function setup() {
  createCanvas(900, 650);
  imageMode(CENTER);
  // assume works[] is already embedded and projected offline
}

function draw() {
  background(10);
  hovered = null;

  for (let w of works) {
    let px = map(w.x, 0, 1, 70, width - 70);
    let py = map(w.y, 0, 1, 70, height - 70);

    // is the mouse over this thumbnail?
    let over = dist(mouseX, mouseY, px, py) < 22;
    if (over) hovered = { w, px, py };

    // fade non-hovered ones slightly so the cluster reads
    tint(255, over ? 255 : 150);
    image(w.img, px, py, over ? 54 : 36, over ? 54 : 36);
  }

  // pop the hovered work up large near the cursor
  if (hovered) {
    noTint();
    let s = 220;
    let hx = constrain(hovered.px, s/2, width - s/2);
    let hy = constrain(hovered.py - 140, s/2, height - s/2);
    image(hovered.w.img, hx, hy, s, s);
    noFill();
    stroke(200);
    rect(hx - s/2, hy - s/2, s, s);
  }

  noStroke();
  fill(120);
  textFont('monospace');
  textAlign(LEFT, BASELINE);
  text('your work, arranged by visual similarity', 16, height - 16);
}

The offline part - embedding the screenshots and running UMAP - you do once and save the coordinates to a JSON file, so the sketch itself just loads points and stays smooth. When I did this with my own pile of sketches I genuinly sat staring at it for ages. You can see your phases. The shader stuff from episodes 32 onward forms a glowing knot in one corner. The early p5 doodles scatter loosely on the other side. The map knew things about my own work that I hadn't put into words. That is the quiet power of embeddings - they surface structure you didn't know was there.

't Komt erop neer...

An embedding is a fixed-length vector (often 128 to 1024 numbers) that represents a complex object - image, word, sound - so that similar things get similar vectors. The moment something is a point in space, similarity becomes a number you can compute and design with
Distance is the core operation. Euclidean measures straight-line distance; cosine similarity measures the angle between vectors and ignores their length. For embeddings, reach for cosine first - direction carries the meaning, length usually doesn't
Image embeddings come almost for free from MobileNet: call infer(img, true) in tf.js and instead of a classification label you get the 1024-number vector from the layer just before the label. Remember to dispose() the tensor or you will leak GPU memory
Nearest-neighbour search builds a visual recommender with no categories at all: embed a collection, compute similarity to a query, sort, show the top few. Closeness in MobileNet's space lines up beautifully with "looks similar" to a human
High-dimensional spaces get flattened to 2D with t-SNE (great local clusters) or UMAP (faster, better global structure). Feed in vectors, get back x,y coordinates, and draw thumbnails at those positions - an "image landscape" where physical proximity equals visual similarity. Same instinct as the graph layouts in episode 85
Words are vectors too (Word2Vec, GloVe, ml5's word2vec). Words used in similar contexts get similar vectors, so the geometry encodes meaning - nearest('ocean') returns sea, waves, shore without any dictionary
Embedding arithmetic is real: king - man + woman lands near queen. The model learned directions that mean "gender", "royalty", etc., purely from statistics. The same works on image embeddings - compute a "glasses direction" from examples and push any face along it
Semantic walking is interpolation through meaning-space: lerp from one vector to another and sample what you pass through (ocean to fire might trace water, steam, heat, flame). Same lerp from episode 16, same idea as the GAN latent walks in episodes 97-99, different space
CLIP embeds text AND images into one shared space, so you can compare across types - search images by typing a sentence, or score how well an image matches a phrase. Text becomes a steering wheel for visuals
The exercise: embed screenshots of your own series of sketches, project with UMAP, render as an interactive thumbnail scatter. A map of your creative journey where clusters and outliers reveal structure in your work you didn't consciously put there

Ten episodes into the ML arc now. The network learned to watch (92), read bodies (93-95), classify deeply (96), paint (97), translate (98), generate from nothing (99), be trained by us (100), and listen (101). And this episode it learned to measure meaning - to say not just "what is this" but "what is this near". Classification gives you a label. Embeddings give you a whole space to move around in, and movement is what we do as creative coders.

Everything in this arc has been building one idea on top of another, and we are close to the point where we put the whole toolbox down on one table and build something real with all of it at once. Bring your webcam, your mic, and an idea for a room you'd like to bring to life.

Sallukes! Thanks for reading.

@femdev

Learn Creative Coding (#102) - Embeddings and Similarity | Ecency