Learn Creative Coding (#101) - ML Audio: Speech and Sound Recogniti...

Learn Creative Coding (#101) - ML Audio: Speech and Sound Recognition

cc-banner

Last episode we trained our own models in Teachable Machine. We pointed a webcam at our hands and taught a network to tell a peace sign from a fist. We even built a little sound classifier that knew a clap from a snap from a whistle. That sound model was fun, but it was also a bit of a teaser. We only scratched the surface of what ML can do with audio, because we trained it ourselves on three or four crude sounds. There is a whole world below that.

So this episode is all about the ear. Not the eye, the ear. We are going to make sketches that listen -- that recognize spoken words, that feel the texture of a sound, that fire exactly on the beat. And the nice part is, most of it runs right in the browser with no Python and no server, same as everything else in this arc.

Here is the thing I want you to hold onto while we go. Way back in episode 19 we did sound-reactive visuals with an FFT. That gave us amplitude and frequency bins. Loud meant big, high pitch meant a different color. It worked, but it was dumb in a specific way -- the FFT could tell you "there is energy at 2kHz" but it could never tell you "someone just said the word stop." It heard the signal but not the meaning. ML closes that gap. That is the whole story of this episode, really :-).

A model that already knows 18 words

ml5 ships with a pre-trained speech model called SpeechCommands18w. Google trained it on tens of thousands of people saying short commands, and it recognizes about eighteen of them: "up", "down", "left", "right", "go", "stop", "yes", "no", the digits zero through nine, plus a couple of filler classes for background noise and unknown sounds. You do not train anything. You just load it and it works.

let soundModel;
let heardLabel = 'listening...';
let heardConfidence = 0;

function preload() {
  // pre-trained, no model URL of our own needed
  soundModel = ml5.soundClassifier('SpeechCommands18w');
}

function setup() {
  createCanvas(640, 360);
  textFont('monospace');
  // start the continuous listening loop
  soundModel.classify(gotCommand);
}

function gotCommand(error, results) {
  if (error) {
    console.log(error);
    return;
  }
  heardLabel = results[0].label;
  heardConfidence = results[0].confidence;
}

function draw() {
  background(10, 12, 18);
  fill(180, 200, 230);
  textSize(40);
  textAlign(CENTER, CENTER);
  text(heardLabel, width / 2, height / 2 - 20);
  textSize(13);
  fill(120, 140, 170);
  text((heardConfidence * 100).toFixed(0) + '%', width / 2, height / 2 + 30);
}

Run that, allow microphone access, and say "up". Say "stop". The word lands on the canvas in big letters with a confidence score. No speech-to-text engine, no cloud API, no API key. The model is small enough to live in your page and run on the audio coming off your mic. That is genuinely wild when you sit with it -- a neural network recognizing human speech, in JavaScript, on a laptop.

It is limited to those eighteen words, sure. But for creative coding that is plenty. We do not need it to transcribe a sentence. We need a handful of reliable triggers, and "up/down/left/right/go/stop/yes/no" is a surprisingly rich little vocabulary to build an interaction around.

Voice as a controller

So let us actually do something with those words. The most direct idea: your voice drives the animation. Say "up" and the particles rise. Say "down" and they fall. "Stop" freezes everything, "go" sets it moving again. No keyboard, no mouse -- you talk to the sketch and it listens.

let soundModel;
let particles = [];
let flowDir = -1;   // -1 = up, +1 = down
let running = true;

function preload() {
  soundModel = ml5.soundClassifier('SpeechCommands18w', { probabilityThreshold: 0.7 });
}

function setup() {
  createCanvas(720, 540);
  for (let i = 0; i < 300; i++) {
    particles.push({ x: random(width), y: random(height), spd: random(0.5, 2.5) });
  }
  soundModel.classify(gotCommand);
}

function gotCommand(error, results) {
  if (error) return;
  let word = results[0].label;

  // only the words we care about, ignore the rest
  if (word === 'up')   flowDir = -1;
  if (word === 'down') flowDir = 1;
  if (word === 'stop') running = false;
  if (word === 'go')   running = true;
}

function draw() {
  background(10, 12, 18, 40);

  for (let p of particles) {
    if (running) {
      p.y += flowDir * p.spd;
      // wrap around the edges
      if (p.y < 0) p.y = height;
      if (p.y > height) p.y = 0;
    }
    noStroke();
    fill(120, 180, 230, 150);
    circle(p.x, p.y, 3);
  }

  fill(160, 180, 210);
  textFont('monospace');
  textSize(12);
  text(running ? 'flowing ' + (flowDir < 0 ? 'up' : 'down') : 'frozen', 12, 24);
}

That probabilityThreshold in the options is doing real work here. At 0.7 the model only calls a word when it is at least 70% sure. Below that it stays quiet. Without it, the model fires constantly on background noise and you get particles flipping direction every time a chair creaks. With it, the sketch only reacts when you clearly speak a command. Tune that number to your room -- noisy space, push it higher.

See where this is going? You have just built a hands-free controller. For an installation that matters a lot. Nobody wants to touch a shared keyboard in a gallery. But everybody is willing to say "go" to a screen.

Below words: the continuous feel of sound

Speech commands are discrete. A word either lands or it does not. But a lot of the most beautiful sound-reactive work lives in the continuous layer -- the stuff that is always changing, the brightness and roughness and loudness of whatever the mic picks up. This is where episode 19 lives, and ML does not replace it so much as sit on top of it.

Let me bring back a little of that FFT world, because we need it as raw material. p5.sound gives us amplitude and a frequency spectrum without any ML at all.

let mic, fft, amp;

function setup() {
  createCanvas(720, 400);
  mic = new p5.AudioIn();
  mic.start();

  fft = new p5.FFT();
  fft.setInput(mic);

  amp = new p5.Amplitude();
  amp.setInput(mic);
}

function draw() {
  background(10, 12, 18);

  let spectrum = fft.analyze();      // 1024 frequency bins, 0-255
  let level = amp.getLevel();        // overall loudness, ~0 to 1

  noFill();
  stroke(100, 180, 220, 160);
  beginShape();
  for (let i = 0; i < spectrum.length; i += 4) {
    let x = map(i, 0, spectrum.length, 0, width);
    let y = map(spectrum[i], 0, 255, height, 0);
    vertex(x, y);
  }
  endShape();

  // loudness as a pulse
  fill(220, 160, 90, 120);
  noStroke();
  circle(width / 2, height / 2, level * 800);
}

Good. That is the signal. Now, from that same spectrum you can pull out a couple of descriptive numbers that say something about the quality of a sound, not just its raw shape. The most useful one for art is the spectral centroid -- basically the center of mass of the frequency spectrum. It tells you whether a sound is "bright" (lots of high frequency energy, like a cymbal or an "sss") or "dark" (mostly low energy, like a thud or an "ooo").

// spectral centroid: where is the "weight" of the spectrum?
// low centroid = dark/muffled, high centroid = bright/sharp
function spectralCentroid(spectrum) {
  let weightedSum = 0;
  let total = 0;
  for (let i = 0; i < spectrum.length; i++) {
    weightedSum += i * spectrum[i];
    total += spectrum[i];
  }
  if (total === 0) return 0;
  return weightedSum / total;   // a bin index, higher = brighter
}

// map brightness to color, loudness to size
function draw() {
  background(10, 12, 18, 30);
  let spectrum = fft.analyze();
  let level = amp.getLevel();

  let centroid = spectralCentroid(spectrum);
  let brightness = map(centroid, 0, spectrum.length, 0, 1);

  let hue = lerp(220, 40, brightness);   // dark sound = blue, bright = gold
  colorMode(HSB, 360, 100, 100, 100);
  noStroke();
  fill(hue, 70, 90, 40);
  circle(width / 2, height / 2, 40 + level * 600);
}

A whistle pushes the centroid up and the circle goes gold. A low hum keeps it down and the circle goes blue. The mapping is continuous and smooth -- exactly the opposite of the discrete word triggers from before. You want both tools in the box.

MFCCs: the fingerprint of a sound

Now we go one layer deeper, into my favorite bit. There is a feature called the MFCC -- Mel-Frequency Cepstral Coefficients, what a mouthful -- and it is the workhorse behind basically all speech and audio ML. You do not need the full math to use it creatively. Here is the intuition.

A single number like the centroid describes one property of a sound. An MFCC describes the whole shape of its spectral envelope as a small vector, usually 13 numbers. Think of it as a compact fingerprint. Two different vowels, "ah" and "ee", have clearly different MFCC vectors even at the same pitch and loudness. The model that recognized "up" and "down" earlier? Under the hood it is comparing MFCC-like features. We are just going to grab those 13 numbers and paint with them.

// conceptual: extracting MFCCs with Meyda, a small audio feature library
// (Meyda runs in the browser alongside p5.sound)

let analyzer;
let mfcc = new Array(13).fill(0);

function setupMeyda(audioContext, sourceNode) {
  analyzer = Meyda.createMeydaAnalyzer({
    audioContext: audioContext,
    source: sourceNode,
    bufferSize: 512,
    featureExtractors: ['mfcc'],
    callback: function(features) {
      // features.mfcc is a 13-element array
      mfcc = features.mfcc;
    }
  });
  analyzer.start();
}

Once you have those 13 numbers updating every frame, you have a 13-parameter generative control surface that responds to the texture of sound. The trick is to map each coefficient to a visual property. One drives color, one drives size, one drives rotation, and so on. Each distinct sound produces a distinct visual because each distinct sound has a distinct fingerprint.

// map the 13 MFCC coefficients to 13 visual properties
// each sound becomes a unique visual signature
function draw() {
  background(10, 12, 18, 25);
  translate(width / 2, height / 2);

  // normalize roughly -- MFCCs are usually in a -50..50 range
  let m = mfcc.map(v => constrain(map(v, -40, 40, 0, 1), 0, 1));

  let petals = floor(map(m[1], 0, 1, 5, 16));
  let radius = map(m[2], 0, 1, 40, 200);
  let hue = map(m[3], 0, 1, 0, 360);
  let spin = map(m[4], 0, 1, -0.05, 0.05);

  colorMode(HSB, 360, 100, 100, 100);
  rotate(frameCount * spin);

  for (let i = 0; i < petals; i++) {
    let a = (TWO_PI / petals) * i;
    let r = radius * (0.6 + m[5 + (i % 6)] * 0.6);
    let x = cos(a) * r;
    let y = sin(a) * r;
    noStroke();
    fill(hue, 70, 90, 60);
    circle(x, y, 10 + m[6] * 40);
  }
}

Hum a low note and you get a small, slow, blue flower. Hiss like a snake and it flips to a fast bright many-petalled thing. Say "ooo" then "eee" and watch it morph between two shapes. You are not reacting to volume anymore -- you are reacting to the actual character of the sound. Honestly the first time this clicked for me at work I sat there making silly noises at my screen for like ten minutes. Worth it :-).

Onsets: catching the exact moment a sound starts

Here is a problem the continuous features do not solve. Say you want a flash exactly when a drum hits, or a particle burst the instant someone claps. Amplitude rising is a clue, but a naive "if loud, flash" check is jittery -- it fires many frames in a row across one beat. What you actually want is onset detection: the precise moment a new sound begins, fired once.

The simplest robust version watches the spectral energy and triggers when it jumps sharply above a running average.

let energyHistory = [];
let historyLen = 30;

function detectOnset(spectrum) {
  // current frame energy
  let energy = 0;
  for (let i = 0; i < spectrum.length; i++) {
    energy += spectrum[i] * spectrum[i];
  }

  // running average of recent energy
  let avg = energyHistory.length
    ? energyHistory.reduce((a, b) => a + b, 0) / energyHistory.length
    : energy;

  // keep history at fixed length
  energyHistory.push(energy);
  if (energyHistory.length > historyLen) energyHistory.shift();

  // onset when this frame jumps well above the recent average
  return energy > avg * 1.6 && energy > 5000;
}

The 1.6 multiplier is the sensitivity. A sound has to be 60% louder than the recent norm to count as an onset, which filters out the steady background and only catches the sharp starts -- the beat, the clap, the consonant. Drop it toward 1.3 for a hair trigger, push it to 2.0 if it is firing on noise.

Now wire it to a visual event.

let bursts = [];

function draw() {
  background(10, 12, 18, 35);
  let spectrum = fft.analyze();

  if (detectOnset(spectrum)) {
    // one burst per onset, at a random spot
    bursts.push({
      x: random(width), y: random(height),
      r: 0, life: 1.0
    });
  }

  for (let i = bursts.length - 1; i >= 0; i--) {
    let b = bursts[i];
    b.r += 6;
    b.life -= 0.03;
    noFill();
    stroke(220, 180, 90, b.life * 200);
    strokeWeight(2);
    circle(b.x, b.y, b.r);
    if (b.life <= 0) bursts.splice(i, 1);
  }
}

Every clap drops a ring that expands and fades. Clap a rhythm and you get a visual rhythm to match, locked to the actual onsets instead of smeared across them. This is the discrete, event-based counterpart to the continuous MFCC painting -- and like I said in episode 100 about classification, the most interesting pieces almost always use both at once. Continuous features set the mood; onsets punctuate it.

Spoken words as accumulating data

Let me show you one more idea that I find quietly lovely. Take the speech model from the top, and instead of using each word as a one-shot trigger, collect the words. Let them pile up on screen as text. The machine listens, and what it thinks it hears becomes the artwork -- a slow conversation between you and a model that only knows eighteen words.

let soundModel;
let saidWords = [];

function preload() {
  soundModel = ml5.soundClassifier('SpeechCommands18w', { probabilityThreshold: 0.75 });
}

function setup() {
  createCanvas(720, 540);
  textFont('monospace');
  soundModel.classify(gotWord);
}

function gotWord(error, results) {
  if (error) return;
  let w = results[0].label;
  // skip the model's filler classes
  if (w === '_background_noise_' || w === '_unknown_') return;

  saidWords.push({
    text: w,
    x: random(width),
    y: random(height),
    size: random(16, 54),
    life: 1.0
  });
  // keep the canvas from overflowing
  if (saidWords.length > 40) saidWords.shift();
}

function draw() {
  background(10, 12, 18, 20);
  noStroke();
  for (let word of saidWords) {
    word.life -= 0.002;
    fill(180, 200, 230, word.life * 200);
    textSize(word.size);
    text(word.text, word.x, word.y);
  }
}

The words fade slowly, so the canvas becomes a drifting record of the last minute of conversation -- "yes yes no stop go up up" hanging in the dark like overheard fragments. It reads as poetry even though the vocabulary is tiny, because the rhythm of what you choose to say carries meaning the model never understands. The gap between what the machine hears and what you mean is the whole emotional content of the piece.

Two ears are better: combining sound with sight

We spent episodes 92 through 96 teaching sketches to see -- pose, hands, classification. Now they can hear. Nothing stops you from doing both at once. An image classifier on the webcam plus a sound classifier on the mic gives you two independent input channels, and the combinations multiply.

let imgModel, soundModel, video;
let seenLabel = '', heardLabel = '';

function preload() {
  imgModel = ml5.imageClassifier('MobileNet');
  soundModel = ml5.soundClassifier('SpeechCommands18w', { probabilityThreshold: 0.7 });
}

function setup() {
  createCanvas(640, 480);
  video = createCapture(VIDEO);
  video.size(640, 480);
  video.hide();

  classifyImage();
  soundModel.classify((err, r) => { if (!err) heardLabel = r[0].label; });
}

function classifyImage() {
  imgModel.classify(video, (err, r) => {
    if (!err) seenLabel = r[0].label;
    classifyImage();
  });
}

function draw() {
  image(video, 0, 0);
  // react to the COMBINATION of what it sees and hears
  if (heardLabel === 'go') {
    filter(INVERT);   // voice command transforms the seen image
  }
  fill(0, 0, 0, 150);
  noStroke();
  rect(0, height - 40, width, 40);
  fill(200);
  textFont('monospace');
  textSize(12);
  text('see: ' + seenLabel + '   hear: ' + heardLabel, 10, height - 16);
}

Show the camera an object and say a word, and the piece responds to both. Multi-sensory input for multi-modal art. The model that sees does not know about the model that hears -- you are the one wiring their outputs together into a single response. That wiring is the creative act.

The boring-but-important bits: latency and privacy

Two practical things before the exercise, because they bite you in real installations.

First, latency. Audio classification is not instant. The model needs a buffer of sound -- roughly a second for the speech model, a couple hundred milliseconds for onset work -- before it can decide. So there is a delay between you making a sound and the visual reacting, somewhere around 100 to 300 milliseconds. For ambient, exploratory pieces nobody notices. For anything rhythmic, where you are trying to lock visuals to a beat, that delay is real and you have to design around it -- sometimes by predicting ahead, sometimes by just embracing the lag as part of the feel.

// cheap trick to smooth over latency on continuous features:
// keep a short trailing buffer and ease toward it, so the
// visual glides into changes instead of snapping late
let smoothLevel = 0;

function draw() {
  let level = amp.getLevel();
  smoothLevel = lerp(smoothLevel, level, 0.2);  // 0.2 = how fast it catches up
  // use smoothLevel instead of level for size/motion
  circle(width / 2, height / 2, 40 + smoothLevel * 600);
}

Second, privacy, and this one is not optional. The microphone is a microphone. Asking for it means asking to listen to a room full of people. So: get consent (the browser prompt helps, but in an installation put up a sign too), process everything locally -- ml5 and Meyda both run in the browser, nothing leaves the machine -- and never, ever store the raw audio. Recognize, react, discard. If you would not want it recorded, do not record it. People relax around a listening artwork once they understand it is not keeping anything.

The exercise: a room that you can hear

Allez, let us put the whole episode into one piece. Build an installation that responds to the ambient sound of a room across all the layers we covered. Silence is calm -- slow drifting particles. Talking makes them cluster and flow. A clap, caught by onset detection, bursts them outward. The spectral brightness drives the color. It is the acoustic life of the room made visible.

let mic, fft, amp;
let particles = [];
let energyHistory = [];

function setup() {
  createCanvas(800, 600);
  mic = new p5.AudioIn();
  mic.start();
  fft = new p5.FFT();
  fft.setInput(mic);
  amp = new p5.Amplitude();
  amp.setInput(mic);

  for (let i = 0; i < 400; i++) {
    particles.push({
      x: random(width), y: random(height),
      vx: 0, vy: 0
    });
  }
  colorMode(HSB, 360, 100, 100, 100);
}

function draw() {
  background(220, 30, 6, 12);

  let spectrum = fft.analyze();
  let level = amp.getLevel();
  let centroid = spectralCentroid(spectrum);
  let hue = map(centroid, 0, spectrum.length, 220, 40);

  // onset = clap = outward shove from the center
  let shove = detectOnset(spectrum);

  for (let p of particles) {
    if (level < 0.02) {
      // near silence: gentle drift
      p.vx += random(-0.05, 0.05);
      p.vy += random(-0.05, 0.05);
    } else {
      // talking: pull toward center, swirl
      let a = atan2(height / 2 - p.y, width / 2 - p.x);
      p.vx += cos(a) * level * 2;
      p.vy += sin(a) * level * 2;
    }
    if (shove) {
      let a = atan2(p.y - height / 2, p.x - width / 2);
      p.vx += cos(a) * 6;
      p.vy += sin(a) * 6;
    }

    p.vx *= 0.94;
    p.vy *= 0.94;
    p.x += p.vx;
    p.y += p.vy;

    // wrap
    if (p.x < 0) p.x += width;
    if (p.x > width) p.x -= width;
    if (p.y < 0) p.y += height;
    if (p.y > height) p.y -= height;

    noStroke();
    fill(hue, 60, 90, 50);
    circle(p.x, p.y, 3);
  }
}

function spectralCentroid(spectrum) {
  let ws = 0, tot = 0;
  for (let i = 0; i < spectrum.length; i++) { ws += i * spectrum[i]; tot += spectrum[i]; }
  return tot === 0 ? 0 : ws / tot;
}

function detectOnset(spectrum) {
  let e = 0;
  for (let i = 0; i < spectrum.length; i++) e += spectrum[i] * spectrum[i];
  let avg = energyHistory.length ? energyHistory.reduce((a, b) => a + b, 0) / energyHistory.length : e;
  energyHistory.push(e);
  if (energyHistory.length > 30) energyHistory.shift();
  return e > avg * 1.6 && e > 5000;
}

Set it up in a room and just leave it. When the room is quiet the particles breathe slowly in a cool blue. When people start talking they pull into a warm swirling knot. When someone claps, bang, the whole field blows apart and settles again. The room performs the artwork without anyone touching anything. That is the dream of ambient interactive work -- the audience does not operate it, they just exist near it, and their existence is enough.

Where this is heading

Notice what every example today had in common. The sound came in, the model gave us a label or a vector, and we mapped that to visuals. Classification gives you a category -- "this is the word stop." Features like MFCCs give you a vector -- 13 numbers describing a fingerprint. And that second idea, the vector, turns out to be the bigger one.

Because once a sound is a vector, you can ask how close two sounds are. Not "is this a clap, yes or no" but "this noise is near that noise in some space of meaning." Two sounds that feel similar end up near each other; two that feel different end up far apart. That notion -- meaning as position, similarity as distance -- is a completely different lens on ML, and it does not stop at sound. It works for images, for words, for anything you can turn into a vector. We have brushed against it a few times now with latent spaces and fingerprints. Next time we look straight at it.

't Komt erop neer...

ml5's SpeechCommands18w is a pre-trained, in-browser speech model that recognizes about 18 short words ("up", "down", "left", "right", "go", "stop", "yes", "no", digits 0-9) with no training and no server. Load it, call classify, and you have a hands-free voice controller -- perfect for installations where nobody wants to touch a shared keyboard
The probabilityThreshold option is essential: it makes the model stay quiet unless it is confident (e.g. 0.7 = only fire above 70%). Without it the model misfires on background noise constantly. Tune it to your room -- noisier space means higher threshold
Episode 19's FFT world still matters. Amplitude and the frequency spectrum give you continuous signal data, and from the spectrum you can compute the spectral centroid -- the "brightness" of a sound (high = sharp/sss, low = dark/ooo). Map it to color for smooth continuous response
MFCCs (Mel-Frequency Cepstral Coefficients) are the real workhorse: a 13-number vector that fingerprints the texture of a sound. Different vowels and timbres produce different vectors even at the same pitch. Map the 13 coefficients to 13 visual properties and every distinct sound gets a distinct visual signature. Meyda extracts them in the browser
Onset detection catches the exact moment a sound starts -- fired once, not smeared across frames. Watch spectral energy and trigger when it jumps sharply above a running average (e.g. 1.6x). This is the discrete, event-based counterpart to continuous features. Great for locking bursts and flashes to claps and beats
The best pieces use both layers at once: continuous features (MFCCs, centroid, level) set the ongoing mood, while discrete onsets and word triggers punctuate it with events. Same lesson as classification in episode 100 -- continuous control plus one-shot events
Spoken words can accumulate as data instead of acting as one-shot triggers. Let recognized words pile up and fade on the canvas and you get a drifting visual poem. The gap between what the model hears and what you mean is the emotional content
Combine ears and eyes: an image classifier on the webcam (episodes 92-96) plus a sound classifier on the mic gives two independent channels whose combinations multiply. The models do not know about each other -- wiring their outputs into one response is the creative act
Practical reality: audio classification has ~100-300ms latency (it needs a buffer of sound to decide). Fine for ambient work, tricky for rhythmic work -- design around it or ease into changes with lerp. And privacy is non-negotiable: get consent, process locally (ml5 and Meyda never send audio anywhere), never store raw audio. Recognize, react, discard
The ambient room installation ties it together: silence drifts cool and slow, talking pulls particles into a warm swirl, claps blow them apart via onset detection, brightness drives the hue. The room performs the piece with nobody touching anything

Nine episodes into the ML arc now. The network learned to watch (episode 92), to read bodies (93-95), to classify deeply (96), to paint (97), to translate (98), to generate from nothing (99), to be trained by us (100). And now it listens. Each step the machine takes on another human sense. We started by giving it eyes. This episode we gave it ears. And under both, quietly, the same idea keeps showing up -- turning messy real-world input into a vector of numbers we can compute with. That idea is about to take center stage.

Sallukes! Thanks for reading.

Hive account@femdev