Learn Creative Coding (#93) - Body as Input: Pose Detection

cc-banner

Last episode we loaded our first ML model in the browser -- MobileNet image classification through ml5.js. We fed webcam frames to a neural network and got back labels and confidence scores. "Tabby cat, 87%." "Coffee cup, 72%." The model saw the world as flat categories. Useful, but limited. It knows WHAT is in front of the camera but not WHERE anything is in the frame, and it has no concept of the human body as a structure. A person is just another category alongside "bookshop" and "volcano."

Pose detection changes that completely. Instead of a single label per frame, you get 17 specific points on the human body -- nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles -- each with x, y coordinates and a confidence score. In real time. From a webcam. The model doesn't just say "there's a person." It says "the left wrist is at pixel (342, 218) with 94% confidence." That spatial precision turns the body into an input device. Your arms become sliders. Your posture becomes a parameter. Dance becomes data.

This is where creative coding gets physical. Everything we've built -- particle systems (episode 11), noise fields (episode 12), color palettes (episode 28), the confidence-driven aesthetics from last episode -- all of it can now respond to your body in real time. The mouse and keyboard were always abstract intermediaries. Pose detection removes them. You move, the art moves. Your body IS the controller.

Setting up pose detection with ml5

ml5 wraps the MoveNet model (Google's successor to PoseNet) in the same friendly API we used last episode. The setup pattern is almost identical to image classification -- load a model, feed it video, get results in a callback. The difference is what comes back: instead of labels, you get an array of keypoints.

let video;
let bodyPose;
let poses = [];

function preload() {
  // load the MoveNet model before setup runs
  bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
}

function setup() {
  createCanvas(640, 480);
  video = createCapture(VIDEO, { flipped: true });
  video.size(640, 480);
  video.hide();

  // start detecting poses from the video
  bodyPose.detectStart(video, function(results) {
    poses = results;
  });
}

function draw() {
  image(video, 0, 0);
}

The flipped: true option mirrors the video horizontally so you see yourself as in a mirror -- raise your right hand and the right side of the screen responds. Without it, left and right are swapped, which feels unnatural when you're standing in front of a webcam.

The detectStart function creates a continuous detection loop, similar to the classifyFrame recursion from last episode but handled internally by ml5. Every time the model finishes processing a frame, it calls your callback with the latest results and immediately starts processing the next frame. On decent hardware, MoveNet runs at about 15-25 fps.

What the model returns

Each element in the poses array represents one detected person. Within each pose, there's a keypoints array with 17 entries. Here's what the structure looks like:

// what poses[0] contains
{
  keypoints: [
    { x: 320, y: 95,  confidence: 0.95, name: 'nose' },
    { x: 308, y: 82,  confidence: 0.91, name: 'left_eye' },
    { x: 335, y: 83,  confidence: 0.88, name: 'right_eye' },
    { x: 290, y: 92,  confidence: 0.72, name: 'left_ear' },
    { x: 352, y: 90,  confidence: 0.68, name: 'right_ear' },
    { x: 265, y: 175, confidence: 0.92, name: 'left_shoulder' },
    { x: 380, y: 180, confidence: 0.93, name: 'right_shoulder' },
    { x: 230, y: 265, confidence: 0.89, name: 'left_elbow' },
    { x: 410, y: 270, confidence: 0.85, name: 'right_elbow' },
    { x: 215, y: 340, confidence: 0.82, name: 'left_wrist' },
    { x: 430, y: 350, confidence: 0.78, name: 'right_wrist' },
    { x: 285, y: 365, confidence: 0.90, name: 'left_hip' },
    { x: 360, y: 368, confidence: 0.88, name: 'right_hip' },
    { x: 275, y: 450, confidence: 0.45, name: 'left_knee' },
    { x: 370, y: 455, confidence: 0.42, name: 'right_knee' },
    { x: 270, y: 540, confidence: 0.30, name: 'left_ankle' },
    { x: 365, y: 545, confidence: 0.28, name: 'right_ankle' }
  ]
}

Notice how confidence drops for lower-body keypoints. That's typical when sitting at a desk -- the webcam can see your face and torso clearly but your legs are partially or fully occluded. The model is guessing where your knees and ankles are based on what it knows about human body proportions. This matters for creative coding: don't blindly use all 17 keypoints. Check confidence first.

Drawing the skeleton

The "hello world" of pose detection is the stick figure -- connect keypoints with lines to draw a skeleton that follows your body. It's simple, it's satisfying, and it shows you instantly whether the detection is working.

// skeleton connection pairs
const connections = [
  ['left_shoulder', 'right_shoulder'],
  ['left_shoulder', 'left_elbow'],
  ['left_elbow', 'left_wrist'],
  ['right_shoulder', 'right_elbow'],
  ['right_elbow', 'right_wrist'],
  ['left_shoulder', 'left_hip'],
  ['right_shoulder', 'right_hip'],
  ['left_hip', 'right_hip'],
  ['left_hip', 'left_knee'],
  ['left_knee', 'left_ankle'],
  ['right_hip', 'right_knee'],
  ['right_knee', 'right_ankle']
];

function getKeypoint(pose, name) {
  return pose.keypoints.find(function(kp) { return kp.name === name; });
}

function draw() {
  image(video, 0, 0);

  if (poses.length === 0) return;

  const pose = poses[0];

  // draw bones
  stroke(100, 220, 180, 180);
  strokeWeight(3);

  for (const pair of connections) {
    const a = getKeypoint(pose, pair[0]);
    const b = getKeypoint(pose, pair[1]);

    // only draw if both points have decent confidence
    if (a.confidence > 0.3 && b.confidence > 0.3) {
      line(a.x, a.y, b.x, b.y);
    }
  }

  // draw joints
  for (const kp of pose.keypoints) {
    if (kp.confidence > 0.3) {
      noStroke();
      fill(220, 120, 100, 200);
      circle(kp.x, kp.y, 8);
    }
  }
}

The 0.3 confidence threshold filters out noisy keypoints. If a joint is occluded (your arm behind your back, your legs under a desk), the model assigns low confidence, and we skip it rather than drawing a wildly wrong position. You'll see the stick figure appear for your upper body and fade out where the camera can't see. That selective rendering is more honest than pretending the model knows where your hidden joints are.

Body as controller: mapping joints to visuals

Here's where it gets interesting. Each keypoint is a continuous x, y value that updates every frame. That means every joint on your body is a pair of parameters you can map to anything. We've been mapping mouse position to visual parameters since episode 6. Now instead of one pointer, you have seventeen.

let bgHue = 0;

function draw() {
  if (poses.length === 0) {
    background(20);
    return;
  }

  const pose = poses[0];
  const leftWrist = getKeypoint(pose, 'left_wrist');
  const rightWrist = getKeypoint(pose, 'right_wrist');
  const nose = getKeypoint(pose, 'nose');

  // left wrist Y controls background hue
  if (leftWrist.confidence > 0.4) {
    bgHue = map(leftWrist.y, 0, height, 0, 360);
  }

  // right wrist X controls saturation
  let sat = 50;
  if (rightWrist.confidence > 0.4) {
    sat = map(rightWrist.x, 0, width, 10, 90);
  }

  // nose Y controls brightness
  let bright = 30;
  if (nose.confidence > 0.5) {
    bright = map(nose.y, 0, height, 50, 15);
  }

  colorMode(HSB, 360, 100, 100);
  background(bgHue, sat, bright);

  // show the current values
  fill(0, 0, 100, 60);
  noStroke();
  textSize(12);
  textFont('monospace');
  text('hue: ' + bgHue.toFixed(0), 20, 30);
  text('sat: ' + sat.toFixed(0), 20, 48);
  text('bright: ' + bright.toFixed(0), 20, 66);
}

Raise your left arm and the background shifts toward red. Lower it and it goes blue. Move your right hand left to right and saturation sweeps from gray to vivid. Duck your head and the scene darkens. Stand tall and it brightens. Three body parameters controlling three visual parameters. Simple, but the experience of it is genuinely surprising the first time -- your body is directly controlling color without touching anything. Makes sense, right? :-)

Measuring distances between joints

Individual keypoint positions are useful, but derived measurements are often more expressive. The distance between your wrists tells you how wide your arms are spread. The distance from nose to mid-hip approximates how much of your body is visible. These are continuous values that change smoothly as you move.

function jointDist(pose, nameA, nameB) {
  const a = getKeypoint(pose, nameA);
  const b = getKeypoint(pose, nameB);
  if (a.confidence < 0.3 || b.confidence < 0.3) return -1;
  return dist(a.x, a.y, b.x, b.y);
}

function draw() {
  background(15, 20, 25);

  if (poses.length === 0) return;
  const pose = poses[0];

  // arm spread: distance between wrists
  const armSpread = jointDist(pose, 'left_wrist', 'right_wrist');

  if (armSpread > 0) {
    // wider spread = more circles, bigger radius
    const numCircles = Math.floor(map(armSpread, 50, 500, 3, 30));
    const maxR = map(armSpread, 50, 500, 20, 200);

    noFill();

    for (let i = 0; i < numCircles; i++) {
      const t = i / numCircles;
      const r = t * maxR;
      const hue = (t * 180 + frameCount) % 360;
      stroke(hue, 60, 70, 30);
      strokeWeight(1.5);
      ellipse(width / 2, height / 2, r * 2, r * 2);
    }
  }

  // shoulder width for normalization
  const shoulderWidth = jointDist(pose, 'left_shoulder', 'right_shoulder');

  if (shoulderWidth > 0 && armSpread > 0) {
    // ratio: arm spread relative to shoulder width
    // > 2.5 means arms wide open, < 1.0 means arms close together
    const ratio = armSpread / shoulderWidth;

    fill(200, 210, 230, 40);
    noStroke();
    textSize(11);
    textFont('monospace');
    text('spread ratio: ' + ratio.toFixed(2), 20, height - 20);
  }
}

The shoulder width measurement is crucial for normalization. A person standing close to the camera has larger pixel distances between joints than someone far away. Dividing arm spread by shoulder width gives you a ratio that's independent of distance from camera. An arm spread ratio of 2.5 means "arms wide open" regardless of whether you're 1 meter or 3 meters from the webcam. Same principle as normalizing data before mapping it (episode 82).

Detecting gestures

A gesture is a relationship between keypoints, not a position. "Hands up" means both wrists are above both shoulders -- not at any specific pixel coordinate, but relatively higher. You define gestures as boolean conditions on keypoint relationships.

function detectGesture(pose) {
  const lw = getKeypoint(pose, 'left_wrist');
  const rw = getKeypoint(pose, 'right_wrist');
  const ls = getKeypoint(pose, 'left_shoulder');
  const rs = getKeypoint(pose, 'right_shoulder');
  const lh = getKeypoint(pose, 'left_hip');
  const rh = getKeypoint(pose, 'right_hip');

  // check confidence on all needed keypoints
  const upperOk = lw.confidence > 0.4 && rw.confidence > 0.4 &&
                  ls.confidence > 0.4 && rs.confidence > 0.4;
  if (!upperOk) return 'unknown';

  // hands up: both wrists above shoulders
  if (lw.y < ls.y && rw.y < rs.y) return 'hands_up';

  // t-pose: wrists roughly at shoulder height, spread wide
  const shoulderY = (ls.y + rs.y) / 2;
  const wristY = (lw.y + rw.y) / 2;
  const spread = dist(lw.x, lw.y, rw.x, rw.y);
  const shoulderW = dist(ls.x, ls.y, rs.x, rs.y);
  if (Math.abs(wristY - shoulderY) < 40 && spread > shoulderW * 2) {
    return 't_pose';
  }

  // arms crossed: each wrist near opposite shoulder
  const lwToRS = dist(lw.x, lw.y, rs.x, rs.y);
  const rwToLS = dist(rw.x, rw.y, ls.x, ls.y);
  if (lwToRS < shoulderW * 0.6 && rwToLS < shoulderW * 0.6) {
    return 'arms_crossed';
  }

  return 'neutral';
}

Each gesture triggers a different visual mode. "hands_up" could spawn firework particles. "t_pose" could freeze the canvas and start a radial pattern. "arms_crossed" could invert all colors. The gesture vocabulary is yours to design -- what matters is that the definitions are relative (wrist above shoulder) rather than absolute (wrist at y=100), so they work regardless of the person's size or distance from camera.

Smoothing and latency

Raw keypoint data is noisy. Even when you're standing still, the detected positions jitter by a few pixels per frame. And there's a 50-100ms delay between your actual movement and the model's output. For slow, deliberate movements this is fine. For fast gestures -- waving, dancing, clapping -- the lag and jitter are visible.

The fix is lerp smoothing, same technique we used for easing animations in episode 16:

let smoothedKeypoints = {};

function smoothPose(pose) {
  for (const kp of pose.keypoints) {
    if (kp.confidence < 0.3) continue;

    if (!smoothedKeypoints[kp.name]) {
      smoothedKeypoints[kp.name] = { x: kp.x, y: kp.y };
    }

    // lerp toward detected position
    // lower = smoother but more lag. 0.3 is a good balance
    const amt = 0.3;
    smoothedKeypoints[kp.name].x = lerp(smoothedKeypoints[kp.name].x, kp.x, amt);
    smoothedKeypoints[kp.name].y = lerp(smoothedKeypoints[kp.name].y, kp.y, amt);
  }
  return smoothedKeypoints;
}

A lerp amount of 0.3 means each frame, the smoothed position moves 30% of the way toward the actual detected position. This eliminates most jitter while keeping the response snappy. Lower values (0.1) are smoother but feel sluggish. Higher values (0.6) are more responsive but show more jitter. For dance-driven art, I usually go with 0.2 -- a slight lag actually looks intentional, like the art is following the dancer rather than snapping to every twitch.

Body trails: movement as drawing

Here's one of the most visually striking uses of pose detection. Instead of drawing the skeleton in its current position, draw lines from each joint's previous position to its current position. Over time, your movement traces paths on the canvas. Dance becomes drawing. The accumulated trails look like long-exposure photography, but genrated in real time.

let prevPositions = {};
let trailCanvas;

function setup() {
  createCanvas(640, 480);
  trailCanvas = createGraphics(640, 480);
  trailCanvas.background(10, 12, 18);

  video = createCapture(VIDEO, { flipped: true });
  video.size(640, 480);
  video.hide();

  bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
  bodyPose.detectStart(video, function(results) {
    poses = results;
  });
}

function draw() {
  if (poses.length > 0) {
    const smooth = smoothPose(poses[0]);

    for (const name of Object.keys(smooth)) {
      const curr = smooth[name];

      if (prevPositions[name]) {
        const prev = prevPositions[name];
        const speed = dist(prev.x, prev.y, curr.x, curr.y);

        if (speed > 1) {
          // color from speed: slow = cool, fast = warm
          const hue = map(speed, 1, 40, 200, 0);
          trailCanvas.stroke(hue, 70, 60, 25);
          trailCanvas.strokeWeight(map(speed, 1, 40, 1, 4));
          trailCanvas.line(prev.x, prev.y, curr.x, curr.y);
        }
      }

      prevPositions[name] = { x: curr.x, y: curr.y };
    }
  }

  // very slow fade
  trailCanvas.fill(10, 12, 18, 3);
  trailCanvas.noStroke();
  trailCanvas.rect(0, 0, 640, 480);

  image(trailCanvas, 0, 0);
}

The separate trailCanvas (a p5.Graphics buffer) accumulates the lines over time. The slow fade (alpha: 3) means old trails linger for several seconds before disappearing. Fast movements leave thick, warm-colored strokes. Slow movements leave thin, cool trails. Stand still and the canvas gradually fades to dark. Dance for thirty seconds and you get a ghostly portrait of your movement -- swooping arcs from your arms, gentle curves from your torso, scattered jitters from your head nodding. The speed-to-color mapping means you can literally see which parts of the dance were energetic (red/orange) versus graceful (blue/teal).

The creative exercise: body-reactive particles

Allez, time to combine everything. We're building a particle system (episode 11) driven by pose detection. Particles emit from both wrists. Left hand spawns cool blue particles, right hand spawns warm orange ones. Movement speed controls emission rate -- move faster, more particles. Confidence controls opacity -- confident keypoints make vivid particles, uncertain ones make ghost particles.

let particles = [];
let video, bodyPose;
let poses = [];
let smoothed = {};

function preload() {
  bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
}

function setup() {
  createCanvas(800, 600);
  colorMode(HSB, 360, 100, 100, 100);
  video = createCapture(VIDEO, { flipped: true });
  video.size(320, 240);
  video.hide();
  bodyPose.detectStart(video, function(r) { poses = r; });
}

function emitFrom(kp, baseHue, speed) {
  const count = Math.floor(map(speed, 0, 30, 0, 5));
  for (let i = 0; i < count; i++) {
    particles.push({
      x: map(kp.x, 0, 320, 0, width),
      y: map(kp.y, 0, 240, 0, height),
      vx: random(-2, 2),
      vy: random(-3, -0.5),
      hue: baseHue + random(-15, 15),
      size: random(3, 10),
      life: 1.0,
      alpha: kp.confidence * 80
    });
  }
}

function draw() {
  background(0, 0, 5, 20);

  if (poses.length > 0) {
    const pose = poses[0];
    const sm = smoothPose(pose);

    const lw = getKeypoint(pose, 'left_wrist');
    const rw = getKeypoint(pose, 'right_wrist');

    if (lw.confidence > 0.3 && smoothed.left_wrist) {
      const prevL = smoothed.left_wrist;
      const speedL = dist(prevL.x, prevL.y, sm.left_wrist.x, sm.left_wrist.y);
      emitFrom(lw, 210, speedL);  // cool blue
    }

    if (rw.confidence > 0.3 && smoothed.right_wrist) {
      const prevR = smoothed.right_wrist;
      const speedR = dist(prevR.x, prevR.y, sm.right_wrist.x, sm.right_wrist.y);
      emitFrom(rw, 25, speedR);  // warm orange
    }

    smoothed = sm;
  }

  // update and draw particles
  for (let i = particles.length - 1; i >= 0; i--) {
    const p = particles[i];
    p.x += p.vx;
    p.vy += 0.03;  // gentle gravity
    p.y += p.vy;
    p.life -= 0.008;

    if (p.life <= 0) {
      particles.splice(i, 1);
      continue;
    }

    noStroke();
    fill(p.hue % 360, 60, 70, p.life * p.alpha);
    circle(p.x, p.y, p.size * p.life);
  }

  // particle count for debugging
  fill(0, 0, 60, 40);
  noStroke();
  textSize(10);
  textFont('monospace');
  text('particles: ' + particles.length, 10, height - 10);
}

Dance for thirty seconds. Wave your arms. Punch the air. Then hold still and watch the particles drift downward and fade. Screenshot the result. What you have is a generative portrait of your dance -- the spatial pattern of where your hands went, the color split between left (blue) and right (orange), the density showing where you lingered verus where you swept through quickly. No two dances produce the same image. The artwork is a record of your specific body moving through a specific moment.

Multiple people

MoveNet can detect multiple bodies in a single frame. The poses array simply contains more than one entry. For creative installations, this is gold -- each person drives their own visual element, and the interaction between people creates emergent patterns that no single person could produce alone.

function draw() {
  background(0, 0, 5, 15);

  for (let p = 0; p < poses.length; p++) {
    const pose = poses[p];
    // assign each person a unique hue
    const personHue = (p * 137.5) % 360;  // golden angle spacing

    for (const kp of pose.keypoints) {
      if (kp.confidence > 0.4) {
        noStroke();
        fill(personHue, 55, 65, 40);
        circle(
          map(kp.x, 0, 320, 0, width),
          map(kp.y, 0, 240, 0, height),
          12
        );
      }
    }
  }
}

The golden angle spacing (137.5 degrees) ensures that even with many people, each gets a visually distinct hue. Person 0 gets hue 0 (red), person 1 gets 137 (green-ish), person 2 gets 275 (violet), and they keep spreading evenly around the color wheel. We used the same golden angle trick in episode 28 for generating harmonious color palettes.

What to watch out for

A few practical things worth knowing.

Performance: MoveNet is heavier than MobileNet classification. Expect 10-20 fps for pose detection, not 30. If your sketch does complex rendering on top of that, total framerate might drop to single digits. Keep particle counts reasonable. Use noSmooth() if you're not doing anti-aliased drawing. Lower the video resolution -- 320x240 is enough for pose detection and much lighter than 640x480.

Lighting: Pose detection works best with even lighting and contrast between the body and background. Dark clothes against a dark wall = bad confidence. Bright clothing against a neutral background = good results. Side lighting that casts strong shadows can confuse the model about where limbs actually are.

Multiple people overlap: When two people overlap in the frame, the model sometimes merges their keypoints into a single franken-skeleton. There's no perfect fix -- it's a limitation of the model. If you're building a multi-person installation, try to keep people spaced apart. Or embrace the glitches -- merged skeletons can be visually interesting in their own right.

The privacy awareness from last episode applies double here. Pose detection extracts detailed body structure from video. Even though ml5 processes everything locally, the data itself -- joint positions over time -- is biometric. It can reveal gait patterns (which are unique like fingerprints), physical disabilities, emotional states (slouching vs upright posture). For installations, be transparent about what the model is detecting and make sure people can opt out by simply walking away from the camera.

't Komt erop neer...

ml5's bodyPose model (MoveNet) detects 17 keypoints on the human body in real time: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles. Each keypoint has x, y coordinates and a confidence score. Setup is similar to the image classifier from last episode -- load model, feed it video, get results in a callback
Drawing a skeleton means connecting keypoint pairs with lines: shoulder to elbow to wrist, hip to knee to ankle. Filter by confidence (0.3 threshold) to avoid drawing noisy low-confidence joints. The stick figure follows your body in real time and is the "hello world" of pose detection
Joint positions are continuous parameters that map to anything: wrist height to background hue, arm spread to circle radius, head position to brightness. Seventeen keypoints give you thirty-four independent parameters (x and y per joint) to drive visual properties
Derived measurements -- distances between joints -- are often more useful than raw positions. Arm spread (wrist-to-wrist distance), body height (nose-to-hip), shoulder width. Normalizing measurements by shoulder width makes them independent of distance from camera
Gesture detection uses relative keypoint relationships, not absolute positions. "Hands up" = wrists above shoulders. "T-pose" = wrists at shoulder height with wide spread. "Arms crossed" = wrists near opposite shoulders. Relative definitions work at any distance and for any body size
Lerp smoothing reduces jitter and masks latency. Each frame, move the smoothed position 30% toward the detected position. Lower values (0.1) = smoother but laggier. Higher values (0.6) = responsive but jittery. 0.2-0.3 is usually the sweet spot
Body trails turn dance into drawing. Store previous positions, draw lines from old to new, accumulate on a persistent canvas with slow fade. Speed mapped to color (fast=warm, slow=cool) creates long-exposure-style movement portraits. The accumulated image is unique to your specific dance
Body-reactive particles combine pose detection with the particle systems from episode 11. Emit from wrists, color by hand (left=cool, right=warm), emission rate from movement speed, opacity from detection confidence. Dance, then screenshot the result
Multiple people are detected automatically -- the poses array just contains more entries. Assign each person a unique hue using golden angle spacing (137.5 degrees) for visually distinct colors
Performance: MoveNet runs at 10-20 fps. Lower the video resolution (320x240 is fine for detection). Keep particle counts reasonable. Lighting and contrast between body and background affect detection quality significantly
This episode's body data is just the beginning of body-as-input. The keypoints we got today are coarse -- 17 points for the whole body. There are models that go much finer: 21 points per hand for individual finger tracking, 468 points on the face for expression detection. The finer the model's perception, the more nuanced your interactive art can be

Two episodes into the ML arc now. We went from flat category labels (episode 92) to spatial body tracking. The model doesn't just see "a person" -- it understands the structure of the body and gives us precise coordinates to work with. Every technique from the first ninety episodes applies. Particles, noise, color, trails, interaction -- we're just feeding them a different kind of input. And the inputs keep getting richer from here.

Sallukes! Thanks for reading.

@femdev

Learn Creative Coding (#93) - Body as Input: Pose Detection | Ecency