Learn Creative Coding (#100) - Training Custom Models with Teachable Machine

cc-banner

Last episode we generated images from nothing. GANs took random noise vectors and produced photographs of people who never existed, landscapes nobody ever painted, cats that never purred. The generator learned to fool the discriminator, and after enough rounds of that adversarial game, it could synthesize images from pure randomness. We walked through latent space, did vector arithmetic on faces ("add smile, subtract glasses"), decomposed GAN output into particle systems, and curated galleries of impossible portraits. The power was real. But every model we used was someone else's model, trained on someone else's data, reflecting someone else's vision.

This episode we flip that. Instead of using pre-trained models that classify the world according to ImageNet's 1000 categories or generate faces from celebrity photo datasets, we train our own. From scratch. With our own data. On our own categories. The model learns YOUR visual vocabulary and responds to YOUR specific inputs. A gesture that means nothing to MobileNet becomes the trigger for a visual mode you designed. A sound that no pre-trained audio classifier would recognize becomes the activation signal for a generative system you built. The ML becomes personal.

And the tool that makes this accessible is Google's Teachable Machine -- a browser-based interface that lets you train image, audio, and pose classifiers without writing a single line of ML code. You capture training examples from your webcam, click "train," and get a model you can export to ml5.js or TensorFlow.js and use directly in your p5 sketches. The whole workflow happens in the browser. No Python. No pip install. No GPU rental. Just a webcam and a few minutes of pointing it at things.

Transfer learning: why 20 images is enough

Before we touch Teachable Machine, there's one thing worth understanding about why this works at all. Training a neural network from scratch to recognize images requires millions of labeled examples and days of GPU compute. That's what it took to train MobileNet on ImageNet's 14 million images. There's no way 20 webcam snapshots could teach a network anything useful from zero.

Teachable Machine doesn't train from zero. It uses transfer learning. It takes a pre-trained model (MobileNet, which already knows edges, textures, shapes, objects -- the full hierarchy of visual features from thousands of hours of training) and freezes everything except the last layer. Your 20 images only need to teach that final layer how to map MobileNet's existing features to YOUR categories. The network already knows what a circle looks like, what fur texture looks like, what a human hand shape looks like. Your training data just says "when you see this combination of features, call it class A; when you see that combination, call it class B."

// conceptual: what transfer learning does
//
// MobileNet full pipeline:
//   image -> [conv layers] -> [feature extraction] -> [classifier] -> "tabby cat"
//          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^
//          FROZEN (pre-trained, not modified)           REPLACED (your classes)
//
// your 20 images train ONLY the classifier layer
// the convolutional features stay as-is from ImageNet training
// that's why 20 images works -- you're fine-tuning, not training
//
// it's like having a translator who already speaks the language fluently
// and you just teach them 5 new vocabulary words
// they don't need to relearn grammar

This is why transfer learning is the practical path for creative coding. You get the benefit of a model trained on millions of images, but you customize the output categories to match your project. MobileNet's feature extraction is your foundation. Your training data is the specialization. The result is a model that sees the world through MobileNet's eyes but classifies it using YOUR labels.

The tradeoff is that your model inherits whatever MobileNet learned and didn't learn. If MobileNet's features don't capture the distinction between your classes -- say, two objects that look identical to MobileNet but are different to you -- then no amount of training data will fix it. Transfer learning is powerful but bounded by the base model's understanding.

Teachable Machine walkthrough

Go to teachablemachine.withgoogle.com. Click "Get Started." Choose "Image Project." You'll see an interface with two default classes (Class 1 and Class 2) and a "Training" panel.

Here's the workflow:

Add classes. Rename "Class 1" to something meaningful -- "peace sign," "fist," "open palm." Add more classes with the "Add a class" button. 3-5 classes is a good starting point.
Capture examples. Click "Webcam" under each class. Hold up your gesture and click "Record." Capture 20-30 examples per class. Move your hand slightly between captures -- different angles, different positions, different distances from the camera. This variation teaches the model to generalize.
Train. Click "Train Model." This takes 10-30 seconds. The training happens in your browser using TensorFlow.js. No data leaves your machine.
Test. After training, the preview panel shows live classification results. Hold up each gesture and check the confidence scores. If a class is weak (low confidence, frequent misclassification), capture more examples for it and retrain.
Export. Click "Export Model." Choose "TensorFlow.js." Upload the model to get a sharable URL, or download the files. The exported model consists of a model.json file and a set of weight files (.bin).

// the export gives you a model URL like:
// https://teachablemachine.withgoogle.com/models/YOUR_MODEL_ID/
//
// this URL points to:
//   model.json     -- architecture and metadata
//   metadata.json  -- class labels
//   weights.bin    -- learned parameters
//
// ml5 can load this directly
// tensorflow.js can load this directly
// the model runs entirely in the browser -- no server needed

The whole process -- from opening the website to having a working custom classifier -- takes about 5 minutes. That's the magic :-). Five minutes and you have a model that recognizes whatever you pointed your webcam at.

Loading your model in p5

Once you've exported from Teachable Machine, loading the model in p5 is straightforward. ml5's imageClassifier can take a Teachable Machine model URL directly.

let classifier;
let video;
let currentLabel = 'waiting...';
let currentConfidence = 0;

function preload() {
  // replace with your own Teachable Machine model URL
  classifier = ml5.imageClassifier(
    'https://teachablemachine.withgoogle.com/models/YOUR_MODEL_ID/model.json'
  );
}

function setup() {
  createCanvas(640, 520);
  video = createCapture(VIDEO);
  video.size(640, 480);
  video.hide();

  classifyVideo();
}

function classifyVideo() {
  classifier.classify(video, function(results) {
    currentLabel = results[0].label;
    currentConfidence = results[0].confidence;
    classifyVideo();
  });
}

function draw() {
  background(10, 12, 18);
  image(video, 0, 0);

  // display classification
  fill(0, 0, 0, 160);
  noStroke();
  rect(0, 480, 640, 40);

  fill(180, 190, 210);
  textSize(14);
  textFont('monospace');
  text(currentLabel + '  ' +
       (currentConfidence * 100).toFixed(1) + '%', 10, 506);
}

Hold up a gesture from your training set. The label updates in real time. Switch gestures and the label changes. The model you trained in Teachable Machine is now running live in your p5 sketch. Every gesture class you defined is available as a string label and a confidence score. From here it's all about mapping those labels and scores to creative output.

Gesture-driven generative modes

The real payoff: each classification result triggers a different generative behavior. Train 4-5 gesture classes, and each one activates a different visual world. Your hand becomes a mode selector for your art.

let classifier;
let video;
let currentLabel = 'none';
let currentConf = 0;
let particles = [];
let noiseOff = 0;

function preload() {
  classifier = ml5.imageClassifier(
    'https://teachablemachine.withgoogle.com/models/YOUR_MODEL_ID/model.json'
  );
}

function setup() {
  createCanvas(800, 600);
  video = createCapture(VIDEO);
  video.size(160, 120);
  video.hide();

  for (let i = 0; i < 200; i++) {
    particles.push({
      x: random(width),
      y: random(height),
      vx: 0, vy: 0,
      size: random(2, 6)
    });
  }

  classifyLoop();
}

function classifyLoop() {
  classifier.classify(video, function(results) {
    currentLabel = results[0].label;
    currentConf = results[0].confidence;
    classifyLoop();
  });
}

function draw() {
  noiseOff += 0.005;

  // each gesture triggers a different visual mode
  if (currentLabel === 'peace sign') {
    drawModeParticles();
  } else if (currentLabel === 'fist') {
    drawModeNoise();
  } else if (currentLabel === 'open palm') {
    drawModeGeometry();
  } else if (currentLabel === 'thumbs up') {
    drawModeText();
  } else {
    // idle / unrecognized
    background(10, 12, 18, 30);
  }

  // tiny video preview in corner
  image(video, width - 165, 5, 160, 120);

  // label
  fill(0, 0, 0, 150);
  noStroke();
  rect(width - 165, 125, 160, 22);
  fill(170, 180, 200);
  textSize(10);
  textFont('monospace');
  text(currentLabel + ' ' + (currentConf * 100).toFixed(0) + '%',
       width - 160, 140);
}

function drawModeParticles() {
  background(10, 12, 18, 15);
  for (let p of particles) {
    let angle = noise(p.x * 0.005, p.y * 0.005, noiseOff) * TWO_PI * 2;
    p.vx += cos(angle) * 0.3;
    p.vy += sin(angle) * 0.3;
    p.vx *= 0.98;
    p.vy *= 0.98;
    p.x += p.vx;
    p.y += p.vy;

    if (p.x < 0) p.x = width;
    if (p.x > width) p.x = 0;
    if (p.y < 0) p.y = height;
    if (p.y > height) p.y = 0;

    noStroke();
    fill(120, 180, 220, 80);
    circle(p.x, p.y, p.size);
  }
}

function drawModeNoise() {
  loadPixels();
  for (let y = 0; y < height; y += 4) {
    for (let x = 0; x < width; x += 4) {
      let n = noise(x * 0.01, y * 0.01, noiseOff * 2);
      let c = n * 255;
      for (let dy = 0; dy < 4; dy++) {
        for (let dx = 0; dx < 4; dx++) {
          let idx = ((y + dy) * width + (x + dx)) * 4;
          pixels[idx] = c * 0.4;
          pixels[idx + 1] = c * 0.6;
          pixels[idx + 2] = c;
          pixels[idx + 3] = 255;
        }
      }
    }
  }
  updatePixels();
}

function drawModeGeometry() {
  background(10, 12, 18, 20);
  let t = frameCount * 0.02;
  for (let i = 0; i < 12; i++) {
    let x = width / 2 + cos(t + i * 0.5) * 200;
    let y = height / 2 + sin(t * 0.7 + i * 0.3) * 150;
    let sz = 30 + sin(t + i) * 20;

    noFill();
    stroke(180 + sin(t + i) * 40, 140, 100 + cos(t) * 60, 120);
    strokeWeight(1.5);

    if (i % 3 === 0) {
      rect(x - sz / 2, y - sz / 2, sz, sz);
    } else if (i % 3 === 1) {
      circle(x, y, sz);
    } else {
      triangle(x, y - sz / 2, x - sz / 2, y + sz / 2, x + sz / 2, y + sz / 2);
    }
  }
}

function drawModeText() {
  background(10, 12, 18, 10);
  let words = ['pixel', 'noise', 'flow', 'mesh', 'wave', 'pulse', 'drift'];
  for (let i = 0; i < 3; i++) {
    let w = words[Math.floor(random(words.length))];
    let x = random(width);
    let y = random(height);
    let sz = random(10, 40);

    fill(random(100, 200), random(100, 180), random(120, 220), random(30, 80));
    noStroke();
    textSize(sz);
    textFont('monospace');
    text(w, x, y);
  }
}

Peace sign = flowing particles on a noise field. Fist = full-screen Perlin noise texture. Open palm = orbiting geometric shapes. Thumbs up = scattered typography. Each gesture activates a different creative system. Switch between them by changing your hand position. Perform a 60-second composition by flowing through gestures -- particles for 10 seconds, cut to geometry, hold noise for a beat, finish on text. Your body conducts the code.

The transition between modes is abrupt here -- one frame you're in particle mode, the next you're in geometry mode. For smoother transitions you'd crossfade by tracking a blend value that drifts toward the target mode:

let targetMode = 'particles';
let modeBlend = { particles: 1, noise: 0, geometry: 0, text: 0 };

function updateBlend() {
  for (let mode in modeBlend) {
    if (mode === targetMode) {
      modeBlend[mode] = lerp(modeBlend[mode], 1.0, 0.05);
    } else {
      modeBlend[mode] = lerp(modeBlend[mode], 0.0, 0.05);
    }
  }
}

// then in draw, render each mode at its blend opacity
// modes with blend near 0 can be skipped entirely

Training a sound classifier

Teachable Machine isn't just for images. Switch to "Audio Project" and you can train a classifier that listens to your microphone and recognizes specific sounds. Claps, snaps, whistles, spoken words, percussive hits -- anything you can produce consistently, the model can learn to distinguish.

The workflow is the same: define classes, record examples, train, export. The difference is input. Instead of webcam snapshots, you're recording 1-second audio clips for each class. Capture 20-30 examples of each sound, varying the intensity and timing slightly. The model learns the spectral pattern (the frequency distribution over time) that characterizes each sound.

let soundClassifier;
let currentSound = 'silence';
let soundConf = 0;

function preload() {
  let options = { probabilityThreshold: 0.7 };
  soundClassifier = ml5.soundClassifier(
    'https://teachablemachine.withgoogle.com/models/YOUR_AUDIO_MODEL/model.json',
    options
  );
}

function setup() {
  createCanvas(800, 600);
  soundClassifier.classify(gotSoundResult);
}

function gotSoundResult(error, results) {
  if (error) {
    console.log(error);
    return;
  }
  currentSound = results[0].label;
  soundConf = results[0].confidence;
}

function draw() {
  // different visual response per sound class
  if (currentSound === 'clap') {
    // burst of particles
    background(10, 12, 18);
    for (let i = 0; i < 50; i++) {
      let angle = random(TWO_PI);
      let r = random(50, 250);
      let x = width / 2 + cos(angle) * r;
      let y = height / 2 + sin(angle) * r;
      noStroke();
      fill(220, 180, 80, random(60, 180));
      circle(x, y, random(3, 12));
    }
  } else if (currentSound === 'snap') {
    // sharp lines
    background(10, 12, 18, 40);
    stroke(160, 200, 220, 150);
    strokeWeight(1);
    for (let i = 0; i < 8; i++) {
      let x1 = random(width);
      let y1 = random(height);
      let x2 = x1 + random(-200, 200);
      let y2 = y1 + random(-200, 200);
      line(x1, y1, x2, y2);
    }
  } else if (currentSound === 'whistle') {
    // smooth waves
    background(10, 12, 18, 10);
    noFill();
    stroke(100, 180, 200, 60);
    strokeWeight(2);
    for (let w = 0; w < 5; w++) {
      beginShape();
      for (let x = 0; x < width; x += 10) {
        let y = height / 2 + sin(x * 0.02 + frameCount * 0.05 + w * 0.5) *
                (100 + w * 30);
        vertex(x, y);
      }
      endShape();
    }
  } else {
    // silence / background noise
    background(10, 12, 18, 5);
  }

  // sound label
  fill(0, 0, 0, 150);
  noStroke();
  rect(0, height - 30, 200, 30);
  fill(160);
  textSize(10);
  textFont('monospace');
  text(currentSound + ' ' + (soundConf * 100).toFixed(0) + '%', 10, height - 12);
}

Clap and particles burst from the center. Snap and sharp lines crack across the canvas. Whistle and smooth sine waves drift. Each sound class triggers a distinct visual response. The mapping is immediate -- the sound classifier runs continuously on microphone input, no button presses needed. Sound-reactive art without any audio analysis code. The ML model handles all the pattern recognition. You just define the classes and the visual responses.

The probabilityThreshold option filters out weak predictions. At 0.7, the model only reports a classification when it's at least 70% confident. Below that threshold, it reports nothing (or "background noise" if you trained a background class). This prevents the visuals from flickering between classes when the input is ambiguous.

Training a pose classifier

Teachable Machine's third project type is pose. Instead of raw images or audio, it uses PoseNet to detect body keypoints and classifies based on the pose itself. Train on body positions: "standing," "sitting," "arms up," "T-pose," "one hand raised." The model doesn't see what you're wearing or what's behind you -- it only sees the skeleton. That makes it more robust to different environments and appearances.

let poseClassifier;
let currentPose = 'unknown';
let poseConf = 0;
let bgHue = 0;
let targetHue = 0;
let trailPoints = [];

function preload() {
  poseClassifier = ml5.imageClassifier(
    'https://teachablemachine.withgoogle.com/models/YOUR_POSE_MODEL/model.json'
  );
}

function setup() {
  createCanvas(800, 600);
  colorMode(HSB, 360, 100, 100, 100);
  video = createCapture(VIDEO);
  video.size(160, 120);
  video.hide();

  classifyPose();
}

function classifyPose() {
  poseClassifier.classify(video, function(results) {
    currentPose = results[0].label;
    poseConf = results[0].confidence;

    // map pose to target hue
    if (currentPose === 'arms up') targetHue = 200;
    else if (currentPose === 'T-pose') targetHue = 40;
    else if (currentPose === 'standing') targetHue = 120;
    else if (currentPose === 'sitting') targetHue = 280;
    else targetHue = 0;

    classifyPose();
  });
}

function draw() {
  bgHue = lerp(bgHue, targetHue, 0.03);

  background(bgHue, 20, 8, 8);

  // trail system driven by current pose
  if (currentPose === 'arms up') {
    // particles rise
    for (let i = 0; i < 3; i++) {
      trailPoints.push({
        x: random(width), y: height,
        vx: random(-1, 1), vy: random(-3, -1),
        life: 120, hue: bgHue
      });
    }
  } else if (currentPose === 'T-pose') {
    // particles expand from center
    for (let i = 0; i < 3; i++) {
      let angle = random(TWO_PI);
      trailPoints.push({
        x: width / 2, y: height / 2,
        vx: cos(angle) * random(1, 4),
        vy: sin(angle) * random(1, 4),
        life: 90, hue: bgHue
      });
    }
  }

  // update and draw trails
  for (let i = trailPoints.length - 1; i >= 0; i--) {
    let p = trailPoints[i];
    p.x += p.vx;
    p.y += p.vy;
    p.life--;

    let alpha = map(p.life, 0, 120, 0, 60);
    noStroke();
    fill(p.hue, 60, 80, alpha);
    circle(p.x, p.y, map(p.life, 0, 120, 1, 6));

    if (p.life <= 0) {
      trailPoints.splice(i, 1);
    }
  }

  // limit trail size
  while (trailPoints.length > 500) {
    trailPoints.shift();
  }

  // HUD
  image(video, 5, 5, 120, 90);
  fill(0, 0, 0, 60);
  rect(0, height - 30, 250, 30);
  fill(0, 0, 90);
  textSize(10);
  textFont('monospace');
  text('pose: ' + currentPose + ' ' + (poseConf * 100).toFixed(0) + '%',
       10, height - 12);
}

Raise both arms and particles float upward, the background shifting toward blue. Stand in T-pose and particles explode outward from the center in warm gold. Stand normally and the background settles to green with no particle emission. Sit down and it goes purple. Your body position drives the color, the particle behavior, and the overall mood. It's an installation piece -- the viewer IS the controller, and their pose shapes the visual environment around them.

Data quality: what makes a good training set

The quality of your custom model depends entirely on the training data. Garbage in, garbage out. A few practical things I've learned from training these:

Variety in examples matters more than quantity. 20 diverse examples beats 50 identical ones. For a gesture class, vary the angle, the distance from camera, the background, the lighting. The model needs to learn the GESTURE, not "hand at exactly this position in exactly this lighting."

Include what you DON'T want. Train a "background" or "nothing" class with examples of empty space, random objects, neutral poses. Without this, the model has to choose between your defined classes for every frame, even when none of them match. A background class gives it an "other" option.

Test with unseen examples. After training, test with poses and conditions the model hasn't seen. If accuracy drops, you need more variety in your training data, not just more of the same.

// practical training strategy for a 5-class gesture model:
//
// class 1: "peace sign"    -- 25 examples, varied angle + distance
// class 2: "fist"          -- 25 examples, varied angle + distance
// class 3: "open palm"     -- 25 examples, varied angle + distance
// class 4: "thumbs up"     -- 25 examples, varied angle + distance
// class 5: "background"    -- 30 examples, empty desk, random objects, face without gesture
//
// total: ~130 examples, 5 minutes of capture
// training time: ~15 seconds
// result: model that works in your room under your lighting
//
// to make it work in OTHER rooms:
// capture examples in different locations, different lighting
// the more variety, the more robust the model

Overfitting is real. With too few examples or too similar examples, the model memorizes the training data instead of learning the pattern. Classic sign: perfect accuracy on training data, terrible accuracy on new inputs. The fix is always more variety. Different backgrounds. Different distances. Different lighting. The model generalizes from variety.

Model as creative instrument

Here's the conceptual shift that matters. A custom-trained model isn't just a classifier. It's a creative instrument you designed. You chose what the categories are. You chose what gestures or objects or sounds to recognize. You defined the input language. And you defined what each recognized input triggers visually. The entire pipeline -- from physical input to visual output, with ML in the middle -- is your creative design.

Compare this to using MobileNet directly. MobileNet's 1000 categories are someone else's taxonomy of the world. "Tabby cat" and "laptop" and "coffee mug" are useful labels but they're generic. Your custom model's categories are specific to YOUR project. "Dance move A" and "dance move B" and "audience cheering" and "silence" -- categories that only make sense in the context of your piece. The model becomes a custom sensor designed for a specific artwork.

// the model as instrument concept:
//
// traditional approach:
//   physical input -> code analysis -> visual output
//   (e.g., microphone -> FFT -> frequency bars)
//
// ML approach:
//   physical input -> trained model -> semantic label -> visual output
//   (e.g., microphone -> sound classifier -> "clap" -> particle burst)
//
// the difference: the ML approach understands WHAT the input means
// not just its raw signal properties
// an FFT can tell you the frequency spectrum of a clap
// a sound classifier can tell you it's A CLAP
// the semantic understanding enables higher-level creative mappings

This distinction -- raw signal vs semantic understanding -- is what makes ML models useful for creative coding beyond what we could do with pure signal analysis. Episode 19 did sound-reactive visuals using FFT (amplitude and frequency). That worked for continuous mappings: louder = bigger, higher pitch = different color. But FFT can't tell a clap from a snap from a whistle. A trained classifier can. It operates at the level of meaning, not signal. And meaning-level control opens up a diffrent kind of creative interaction.

Continuous classification vs discrete events

The examples above treat classification as continuous -- every frame gets a label. But sometimes you want discrete events: trigger something ONCE when a gesture is detected, not continuously while it's held. You need to detect the transition from "not this class" to "this class."

let previousLabel = '';
let currentLabel = '';
let eventLog = [];

function gotResult(results) {
  previousLabel = currentLabel;
  currentLabel = results[0].label;

  // detect transitions
  if (currentLabel !== previousLabel) {
    onClassChange(currentLabel, previousLabel);
  }
}

function onClassChange(newClass, oldClass) {
  // this fires ONCE when the classification changes
  let timestamp = millis();
  eventLog.push({ time: timestamp, from: oldClass, to: newClass });

  // trigger one-shot effects
  if (newClass === 'clap') {
    spawnExplosion(width / 2, height / 2);
  } else if (newClass === 'snap') {
    invertColors();
  } else if (newClass === 'whistle') {
    startWaveAnimation();
  }
}

function spawnExplosion(cx, cy) {
  for (let i = 0; i < 100; i++) {
    let angle = random(TWO_PI);
    let speed = random(2, 8);
    particles.push({
      x: cx, y: cy,
      vx: cos(angle) * speed,
      vy: sin(angle) * speed,
      life: random(30, 80),
      r: random(200, 255),
      g: random(120, 200),
      b: random(60, 120)
    });
  }
}

Now a clap spawns exactly one explosion, not a continuous stream. A snap inverts colors once. A whistle starts one animation. The classification still runs every frame, but the creative response only fires on transitions. This gives you both modes: continuous control (the current label drives ongoing behavior) and discrete events (transitions trigger one-shot effects). Most interesting interactive pieces use both.

Iterative improvement: the training loop

Your first model won't be perfect. It'll confuse similar gestures, misfire on background noise, lose accuracy when the lighting changes. That's normal. The real workflow is iterative:

Train a first version with 20-30 examples per class
Deploy it in your sketch
Use it. Watch where it fails
Go back to Teachable Machine. Add more examples specifically for the failure cases
Retrain
Repeat

// logging model behavior for improvement
let confusionLog = [];

function gotResult(results) {
  currentLabel = results[0].label;
  currentConf = results[0].confidence;

  // log low-confidence predictions
  if (currentConf < 0.7) {
    confusionLog.push({
      time: millis(),
      predicted: currentLabel,
      confidence: currentConf,
      allResults: results.map(function(r) {
        return { label: r.label, conf: r.confidence.toFixed(3) };
      })
    });
  }
}

function keyPressed() {
  if (key === 'l') {
    // dump confusion log to console
    console.log('Low-confidence predictions:');
    for (let entry of confusionLog) {
      console.log(entry.time.toFixed(0) + 'ms: ' +
                  entry.predicted + ' (' + (entry.confidence * 100).toFixed(1) + '%)');
      for (let r of entry.allResults) {
        console.log('  ' + r.label + ': ' + r.conf);
      }
    }
  }
}

Press L and you get a log of every low-confidence prediction. If the model consistently confuses "peace sign" with "open palm" at 55% vs 45% confidence, you know exactly what to fix: capture more examples that clearly distingusih those two gestures. Maybe exaggerate the difference -- wider fingers for peace sign, flatter palm for open palm. The confusion log tells you what the model struggles with, and that guides your next round of training data.

The creative exercise: performable ML composition

Allez, time for the big one. Train a 5-class gesture classifier in Teachable Machine. Export it to ml5. Build a p5 sketch where each gesture triggers a different generative mode with smooth crossfade transitions. Then perform a 60-second composition by switching between gestures. Your body movements conduct an AI you trained yourself.

let classifier, video;
let modes = {};
let activeMode = 'idle';
let modeAlpha = {};
let particles = [];
let wavePhase = 0;

function preload() {
  classifier = ml5.imageClassifier(
    'https://teachablemachine.withgoogle.com/models/YOUR_MODEL/model.json'
  );
}

function setup() {
  createCanvas(800, 600);
  video = createCapture(VIDEO);
  video.size(160, 120);
  video.hide();

  modeAlpha = {
    'peace sign': 0,
    'fist': 0,
    'open palm': 0,
    'thumbs up': 0,
    'point': 0,
    'idle': 0
  };

  // pre-populate particles
  for (let i = 0; i < 150; i++) {
    particles.push({
      x: random(width), y: random(height),
      vx: 0, vy: 0,
      homeX: random(width), homeY: random(height),
      size: random(2, 5)
    });
  }

  classifyLoop();
}

function classifyLoop() {
  classifier.classify(video, function(results) {
    activeMode = results[0].label;
    classifyLoop();
  });
}

function draw() {
  // crossfade mode alphas
  for (let mode in modeAlpha) {
    if (mode === activeMode) {
      modeAlpha[mode] = lerp(modeAlpha[mode], 255, 0.06);
    } else {
      modeAlpha[mode] = lerp(modeAlpha[mode], 0, 0.06);
    }
  }

  background(10, 12, 18);
  wavePhase += 0.02;

  // render each mode at its current alpha
  if (modeAlpha['peace sign'] > 5) {
    drawPeaceMode(modeAlpha['peace sign']);
  }
  if (modeAlpha['fist'] > 5) {
    drawFistMode(modeAlpha['fist']);
  }
  if (modeAlpha['open palm'] > 5) {
    drawPalmMode(modeAlpha['open palm']);
  }
  if (modeAlpha['thumbs up'] > 5) {
    drawThumbsMode(modeAlpha['thumbs up']);
  }
  if (modeAlpha['point'] > 5) {
    drawPointMode(modeAlpha['point']);
  }

  // video preview
  tint(255, 120);
  image(video, width - 170, 10, 160, 120);
  noTint();

  fill(0, 0, 0, 140);
  noStroke();
  rect(width - 170, 130, 160, 18);
  fill(170, 180, 200);
  textFont('monospace');
  textSize(9);
  text(activeMode, width - 165, 143);
}

function drawPeaceMode(alpha) {
  // flowing particles in a noise field
  for (let p of particles) {
    let angle = noise(p.x * 0.003, p.y * 0.003, wavePhase) * TWO_PI * 2;
    p.vx += cos(angle) * 0.4;
    p.vy += sin(angle) * 0.4;
    p.vx *= 0.96;
    p.vy *= 0.96;
    p.x += p.vx;
    p.y += p.vy;
    if (p.x < 0) p.x += width;
    if (p.x > width) p.x -= width;
    if (p.y < 0) p.y += height;
    if (p.y > height) p.y -= height;

    noStroke();
    fill(100, 170, 230, alpha * 0.3);
    circle(p.x, p.y, p.size);
  }
}

function drawFistMode(alpha) {
  // dense noise grid
  let step = 8;
  for (let y = 0; y < height; y += step) {
    for (let x = 0; x < width; x += step) {
      let n = noise(x * 0.008, y * 0.008, wavePhase * 3);
      noStroke();
      fill(n * 200, n * 100, 20, alpha * n * 0.8);
      rect(x, y, step, step);
    }
  }
}

function drawPalmMode(alpha) {
  // concentric expanding rings
  noFill();
  for (let i = 0; i < 15; i++) {
    let r = ((frameCount * 2 + i * 40) % 500);
    let a = map(r, 0, 500, alpha * 0.8, 0);
    stroke(180, 220, 200, a);
    strokeWeight(1.5);
    circle(width / 2, height / 2, r * 2);
  }
}

function drawThumbsMode(alpha) {
  // rising text fragments
  let words = ['create', 'train', 'learn', 'build', 'see', 'hear', 'move'];
  if (frameCount % 8 === 0) {
    let w = words[frameCount % words.length];
    noStroke();
    fill(200, 180, 120, alpha * 0.4);
    textSize(random(12, 36));
    textFont('monospace');
    text(w, random(width), height);
  }
}

function drawPointMode(alpha) {
  // laser lines from center
  let numLines = 24;
  for (let i = 0; i < numLines; i++) {
    let angle = (TWO_PI / numLines) * i + wavePhase;
    let len = 200 + noise(i * 0.5, wavePhase) * 200;
    stroke(220, 100, 100, alpha * 0.5);
    strokeWeight(1);
    let cx = width / 2;
    let cy = height / 2;
    line(cx, cy, cx + cos(angle) * len, cy + sin(angle) * len);
  }
}

Five gestures, five visual worlds, smooth crossfades between them. Hold peace sign and particles flow across a noise field. Switch to fist and a fiery noise grid fades in while the particles fade out. Open your palm and concentric rings pulse from the center. Thumbs up fills the screen with rising words. Point and laser lines radiate outward. The transitions blend because each mode renders at its own alpha level, and the alphas smoothly interpolate toward the active mode.

Perform it like a musical instrument. Hold each gesture for 5-10 seconds, then transition. Some transitions look better than others -- particles fading into rings is smooth, noise grid fading into laser lines is dramatic. Discover the good transitions through practice. Record the screen (episode 20) and you have a video piece conducted entirely by hand gestures, interpreted by a model you trained in five minutes.

Where this leads

Everything we've done with classification so far takes a single input stream and maps it to labels. One camera, one microphone, one body. But the real power of custom models emerges when you combine them. An image classifier reading the webcam simultaneously with a sound classifier listening to the microphone gives you two independent classification channels. Gesture A + sound B triggers a different response than gesture A + sound C. The combination space grows multiplicatively.

And we haven't talked about what happens when ML models process the audio signal itself -- not classifying sounds but understanding speech, recognizing musical patterns, decomposing audio into components. Sound is a rich domain for creative coding, and ML models that understand it go beyond the simple clap/snap/whistle classifier we built today. That's fertile territory.

There's also the question of what happens when you move beyond classification entirely. Classification maps inputs to discrete labels. But what if you want to map inputs to continuous vectors -- to find similarity, to cluster, to navigate a space of meanings? That's embeddings, and it's a different way of thinking about ML for creative purposes. Instead of "this is a clap" you get "this sound is close to that sound in meaning-space." Proximity rather than category.

't Komt erop neer...

Teachable Machine is a browser-based tool from Google that lets you train custom image, audio, and pose classifiers without writing ML code. Capture 20-30 examples per class from your webcam or microphone, click train, and export to ml5.js or TensorFlow.js. The whole workflow takes about 5 minutes. No Python, no GPU, no dataset preparation
Transfer learning is why this works. Teachable Machine doesn't train from scratch -- it takes a pre-trained MobileNet (trained on millions of images) and only retrains the final classification layer. Your 20 images teach the model how to map existing visual features to YOUR categories. The heavy lifting was done during MobileNet's original training
Custom gesture classifiers turn your body into a creative controller. Train 4-5 gesture classes ("peace sign", "fist", "open palm", "thumbs up", "point") and map each to a different generative mode in p5. Switch gestures to switch visual worlds. Smooth crossfade transitions between modes make it performable -- your hand movements conduct the code
Sound classifiers respond to specific audio events: claps, snaps, whistles, words, percussive hits. The model classifies microphone input in real time based on spectral patterns learned from your training examples. Sound-triggered art without manual FFT analysis or amplitude thresholding -- the ML handles the pattern recognition
Pose classifiers analyze body position rather than raw images. Train on "standing," "arms up," "T-pose," "sitting" and the model classifies based on skeleton keypoints, making it robust to different clothing, backgrounds, and lighting. Each pose drives different particle behaviors, colors, and visual modes
Data quality matters more than quantity. 20 diverse examples (varied angles, distances, lighting, backgrounds) beat 50 identical ones. Include a "background" or "nothing" class to give the model an "other" option. Test with unseen conditions. If accuracy drops, add more variety to the training data
The iterative training loop: train first version, deploy in your sketch, observe failures, add targeted training examples for weak cases, retrain, repeat. Log low-confidence predictions to identify what the model struggles with. Confusion between classes guides your next round of data capture
Continuous classification (every frame gets a label) vs discrete events (trigger once on class transition). Most interactive pieces use both: the current label drives ongoing visual behavior while transitions trigger one-shot effects like particle explosions or color inversions. Track the previous label and compare to detect transitions
The model-as-instrument concept: instead of raw signal analysis (FFT gives frequencies, amplitude gives loudness), a trained classifier gives semantic understanding (this is A CLAP, not just "a transient spike at 2kHz"). Semantic control enables higher-level creative mappings than raw signal analysis alone
Overfitting is the main failure mode -- the model memorizes training data instead of learning patterns. Classic symptom: perfect on training examples, terrible on new inputs. The fix is always more variety in training data, not more quantity of similar examples. Different backgrounds, distances, lighting conditions force the model to learn the actual pattern

Eight episodes into the ML arc now. Classification watches (episode 92). Body tracking follows (93-95). Deep classification branches (96). Style transfer paints (97). Pix2Pix translates (98). GANs generate from nothing (99). And now we've trained our own models -- classification systems custom-built for our specific creative vision. Each episode the ML gets more personal. We went from using pre-built models as black boxes to training our own models with our own data for our own purposes. The model isn't something we use anymore. It's something we make.

Sallukes! Thanks for reading.

@femdev

Learn Creative Coding (#100) - Training Custom Models with Teachabl... | Ecency