Last episode we loaded our first ML model in the browser -- MobileNet image classification through ml5.js. We fed webcam frames to a neural network and got back labels and confidence scores. "Tabby cat, 87%." "Coffee cup, 72%." The model saw the world as flat categories. Useful, but limited. It knows WHAT is in front of the camera but not WHERE anything is in the frame, and it has no concept of the human body as a structure. A person is just another category alongside "bookshop" and "volcano."
Pose detection changes that completely. Instead of a single label per frame, you get 17 specific points on the human body -- nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles -- each with x, y coordinates and a confidence score. In real time. From a webcam. The model doesn't just say "there's a person." It says "the left wrist is at pixel (342, 218) with 94% confidence." That spatial precision turns the body into an input device. Your arms become sliders. Your posture becomes a parameter. Dance becomes data.
This is where creative coding gets physical. Everything we've built -- particle systems (episode 11), noise fields (episode 12), color palettes (episode 28), the confidence-driven aesthetics from last episode -- all of it can now respond to your body in real time. The mouse and keyboard were always abstract intermediaries. Pose detection removes them. You move, the art moves. Your body IS the controller.
ml5 wraps the MoveNet model (Google's successor to PoseNet) in the same friendly API we used last episode. The setup pattern is almost identical to image classification -- load a model, feed it video, get results in a callback. The difference is what comes back: instead of labels, you get an array of keypoints.
let video;
let bodyPose;
let poses = [];
function preload() {
// load the MoveNet model before setup runs
bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
}
function setup() {
createCanvas(640, 480);
video = createCapture(VIDEO, { flipped: true });
video.size(640, 480);
video.hide();
// start detecting poses from the video
bodyPose.detectStart(video, function(results) {
poses = results;
});
}
function draw() {
image(video, 0, 0);
}
The flipped: true option mirrors the video horizontally so you see yourself as in a mirror -- raise your right hand and the right side of the screen responds. Without it, left and right are swapped, which feels unnatural when you're standing in front of a webcam.
The detectStart function creates a continuous detection loop, similar to the classifyFrame recursion from last episode but handled internally by ml5. Every time the model finishes processing a frame, it calls your callback with the latest results and immediately starts processing the next frame. On decent hardware, MoveNet runs at about 15-25 fps.
Each element in the poses array represents one detected person. Within each pose, there's a keypoints array with 17 entries. Here's what the structure looks like:
// what poses[0] contains
{
keypoints: [
{ x: 320, y: 95, confidence: 0.95, name: 'nose' },
{ x: 308, y: 82, confidence: 0.91, name: 'left_eye' },
{ x: 335, y: 83, confidence: 0.88, name: 'right_eye' },
{ x: 290, y: 92, confidence: 0.72, name: 'left_ear' },
{ x: 352, y: 90, confidence: 0.68, name: 'right_ear' },
{ x: 265, y: 175, confidence: 0.92, name: 'left_shoulder' },
{ x: 380, y: 180, confidence: 0.93, name: 'right_shoulder' },
{ x: 230, y: 265, confidence: 0.89, name: 'left_elbow' },
{ x: 410, y: 270, confidence: 0.85, name: 'right_elbow' },
{ x: 215, y: 340, confidence: 0.82, name: 'left_wrist' },
{ x: 430, y: 350, confidence: 0.78, name: 'right_wrist' },
{ x: 285, y: 365, confidence: 0.90, name: 'left_hip' },
{ x: 360, y: 368, confidence: 0.88, name: 'right_hip' },
{ x: 275, y: 450, confidence: 0.45, name: 'left_knee' },
{ x: 370, y: 455, confidence: 0.42, name: 'right_knee' },
{ x: 270, y: 540, confidence: 0.30, name: 'left_ankle' },
{ x: 365, y: 545, confidence: 0.28, name: 'right_ankle' }
]
}
Notice how confidence drops for lower-body keypoints. That's typical when sitting at a desk -- the webcam can see your face and torso clearly but your legs are partially or fully occluded. The model is guessing where your knees and ankles are based on what it knows about human body proportions. This matters for creative coding: don't blindly use all 17 keypoints. Check confidence first.
The "hello world" of pose detection is the stick figure -- connect keypoints with lines to draw a skeleton that follows your body. It's simple, it's satisfying, and it shows you instantly whether the detection is working.
// skeleton connection pairs
const connections = [
['left_shoulder', 'right_shoulder'],
['left_shoulder', 'left_elbow'],
['left_elbow', 'left_wrist'],
['right_shoulder', 'right_elbow'],
['right_elbow', 'right_wrist'],
['left_shoulder', 'left_hip'],
['right_shoulder', 'right_hip'],
['left_hip', 'right_hip'],
['left_hip', 'left_knee'],
['left_knee', 'left_ankle'],
['right_hip', 'right_knee'],
['right_knee', 'right_ankle']
];
function getKeypoint(pose, name) {
return pose.keypoints.find(function(kp) { return kp.name === name; });
}
function draw() {
image(video, 0, 0);
if (poses.length === 0) return;
const pose = poses[0];
// draw bones
stroke(100, 220, 180, 180);
strokeWeight(3);
for (const pair of connections) {
const a = getKeypoint(pose, pair[0]);
const b = getKeypoint(pose, pair[1]);
// only draw if both points have decent confidence
if (a.confidence > 0.3 && b.confidence > 0.3) {
line(a.x, a.y, b.x, b.y);
}
}
// draw joints
for (const kp of pose.keypoints) {
if (kp.confidence > 0.3) {
noStroke();
fill(220, 120, 100, 200);
circle(kp.x, kp.y, 8);
}
}
}
The 0.3 confidence threshold filters out noisy keypoints. If a joint is occluded (your arm behind your back, your legs under a desk), the model assigns low confidence, and we skip it rather than drawing a wildly wrong position. You'll see the stick figure appear for your upper body and fade out where the camera can't see. That selective rendering is more honest than pretending the model knows where your hidden joints are.
Here's where it gets interesting. Each keypoint is a continuous x, y value that updates every frame. That means every joint on your body is a pair of parameters you can map to anything. We've been mapping mouse position to visual parameters since episode 6. Now instead of one pointer, you have seventeen.
let bgHue = 0;
function draw() {
if (poses.length === 0) {
background(20);
return;
}
const pose = poses[0];
const leftWrist = getKeypoint(pose, 'left_wrist');
const rightWrist = getKeypoint(pose, 'right_wrist');
const nose = getKeypoint(pose, 'nose');
// left wrist Y controls background hue
if (leftWrist.confidence > 0.4) {
bgHue = map(leftWrist.y, 0, height, 0, 360);
}
// right wrist X controls saturation
let sat = 50;
if (rightWrist.confidence > 0.4) {
sat = map(rightWrist.x, 0, width, 10, 90);
}
// nose Y controls brightness
let bright = 30;
if (nose.confidence > 0.5) {
bright = map(nose.y, 0, height, 50, 15);
}
colorMode(HSB, 360, 100, 100);
background(bgHue, sat, bright);
// show the current values
fill(0, 0, 100, 60);
noStroke();
textSize(12);
textFont('monospace');
text('hue: ' + bgHue.toFixed(0), 20, 30);
text('sat: ' + sat.toFixed(0), 20, 48);
text('bright: ' + bright.toFixed(0), 20, 66);
}
Raise your left arm and the background shifts toward red. Lower it and it goes blue. Move your right hand left to right and saturation sweeps from gray to vivid. Duck your head and the scene darkens. Stand tall and it brightens. Three body parameters controlling three visual parameters. Simple, but the experience of it is genuinely surprising the first time -- your body is directly controlling color without touching anything. Makes sense, right? :-)
Individual keypoint positions are useful, but derived measurements are often more expressive. The distance between your wrists tells you how wide your arms are spread. The distance from nose to mid-hip approximates how much of your body is visible. These are continuous values that change smoothly as you move.
function jointDist(pose, nameA, nameB) {
const a = getKeypoint(pose, nameA);
const b = getKeypoint(pose, nameB);
if (a.confidence < 0.3 || b.confidence < 0.3) return -1;
return dist(a.x, a.y, b.x, b.y);
}
function draw() {
background(15, 20, 25);
if (poses.length === 0) return;
const pose = poses[0];
// arm spread: distance between wrists
const armSpread = jointDist(pose, 'left_wrist', 'right_wrist');
if (armSpread > 0) {
// wider spread = more circles, bigger radius
const numCircles = Math.floor(map(armSpread, 50, 500, 3, 30));
const maxR = map(armSpread, 50, 500, 20, 200);
noFill();
for (let i = 0; i < numCircles; i++) {
const t = i / numCircles;
const r = t * maxR;
const hue = (t * 180 + frameCount) % 360;
stroke(hue, 60, 70, 30);
strokeWeight(1.5);
ellipse(width / 2, height / 2, r * 2, r * 2);
}
}
// shoulder width for normalization
const shoulderWidth = jointDist(pose, 'left_shoulder', 'right_shoulder');
if (shoulderWidth > 0 && armSpread > 0) {
// ratio: arm spread relative to shoulder width
// > 2.5 means arms wide open, < 1.0 means arms close together
const ratio = armSpread / shoulderWidth;
fill(200, 210, 230, 40);
noStroke();
textSize(11);
textFont('monospace');
text('spread ratio: ' + ratio.toFixed(2), 20, height - 20);
}
}
The shoulder width measurement is crucial for normalization. A person standing close to the camera has larger pixel distances between joints than someone far away. Dividing arm spread by shoulder width gives you a ratio that's independent of distance from camera. An arm spread ratio of 2.5 means "arms wide open" regardless of whether you're 1 meter or 3 meters from the webcam. Same principle as normalizing data before mapping it (episode 82).
A gesture is a relationship between keypoints, not a position. "Hands up" means both wrists are above both shoulders -- not at any specific pixel coordinate, but relatively higher. You define gestures as boolean conditions on keypoint relationships.
function detectGesture(pose) {
const lw = getKeypoint(pose, 'left_wrist');
const rw = getKeypoint(pose, 'right_wrist');
const ls = getKeypoint(pose, 'left_shoulder');
const rs = getKeypoint(pose, 'right_shoulder');
const lh = getKeypoint(pose, 'left_hip');
const rh = getKeypoint(pose, 'right_hip');
// check confidence on all needed keypoints
const upperOk = lw.confidence > 0.4 && rw.confidence > 0.4 &&
ls.confidence > 0.4 && rs.confidence > 0.4;
if (!upperOk) return 'unknown';
// hands up: both wrists above shoulders
if (lw.y < ls.y && rw.y < rs.y) return 'hands_up';
// t-pose: wrists roughly at shoulder height, spread wide
const shoulderY = (ls.y + rs.y) / 2;
const wristY = (lw.y + rw.y) / 2;
const spread = dist(lw.x, lw.y, rw.x, rw.y);
const shoulderW = dist(ls.x, ls.y, rs.x, rs.y);
if (Math.abs(wristY - shoulderY) < 40 && spread > shoulderW * 2) {
return 't_pose';
}
// arms crossed: each wrist near opposite shoulder
const lwToRS = dist(lw.x, lw.y, rs.x, rs.y);
const rwToLS = dist(rw.x, rw.y, ls.x, ls.y);
if (lwToRS < shoulderW * 0.6 && rwToLS < shoulderW * 0.6) {
return 'arms_crossed';
}
return 'neutral';
}
Each gesture triggers a different visual mode. "hands_up" could spawn firework particles. "t_pose" could freeze the canvas and start a radial pattern. "arms_crossed" could invert all colors. The gesture vocabulary is yours to design -- what matters is that the definitions are relative (wrist above shoulder) rather than absolute (wrist at y=100), so they work regardless of the person's size or distance from camera.
Raw keypoint data is noisy. Even when you're standing still, the detected positions jitter by a few pixels per frame. And there's a 50-100ms delay between your actual movement and the model's output. For slow, deliberate movements this is fine. For fast gestures -- waving, dancing, clapping -- the lag and jitter are visible.
The fix is lerp smoothing, same technique we used for easing animations in episode 16:
let smoothedKeypoints = {};
function smoothPose(pose) {
for (const kp of pose.keypoints) {
if (kp.confidence < 0.3) continue;
if (!smoothedKeypoints[kp.name]) {
smoothedKeypoints[kp.name] = { x: kp.x, y: kp.y };
}
// lerp toward detected position
// lower = smoother but more lag. 0.3 is a good balance
const amt = 0.3;
smoothedKeypoints[kp.name].x = lerp(smoothedKeypoints[kp.name].x, kp.x, amt);
smoothedKeypoints[kp.name].y = lerp(smoothedKeypoints[kp.name].y, kp.y, amt);
}
return smoothedKeypoints;
}
A lerp amount of 0.3 means each frame, the smoothed position moves 30% of the way toward the actual detected position. This eliminates most jitter while keeping the response snappy. Lower values (0.1) are smoother but feel sluggish. Higher values (0.6) are more responsive but show more jitter. For dance-driven art, I usually go with 0.2 -- a slight lag actually looks intentional, like the art is following the dancer rather than snapping to every twitch.
Here's one of the most visually striking uses of pose detection. Instead of drawing the skeleton in its current position, draw lines from each joint's previous position to its current position. Over time, your movement traces paths on the canvas. Dance becomes drawing. The accumulated trails look like long-exposure photography, but genrated in real time.
let prevPositions = {};
let trailCanvas;
function setup() {
createCanvas(640, 480);
trailCanvas = createGraphics(640, 480);
trailCanvas.background(10, 12, 18);
video = createCapture(VIDEO, { flipped: true });
video.size(640, 480);
video.hide();
bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
bodyPose.detectStart(video, function(results) {
poses = results;
});
}
function draw() {
if (poses.length > 0) {
const smooth = smoothPose(poses[0]);
for (const name of Object.keys(smooth)) {
const curr = smooth[name];
if (prevPositions[name]) {
const prev = prevPositions[name];
const speed = dist(prev.x, prev.y, curr.x, curr.y);
if (speed > 1) {
// color from speed: slow = cool, fast = warm
const hue = map(speed, 1, 40, 200, 0);
trailCanvas.stroke(hue, 70, 60, 25);
trailCanvas.strokeWeight(map(speed, 1, 40, 1, 4));
trailCanvas.line(prev.x, prev.y, curr.x, curr.y);
}
}
prevPositions[name] = { x: curr.x, y: curr.y };
}
}
// very slow fade
trailCanvas.fill(10, 12, 18, 3);
trailCanvas.noStroke();
trailCanvas.rect(0, 0, 640, 480);
image(trailCanvas, 0, 0);
}
The separate trailCanvas (a p5.Graphics buffer) accumulates the lines over time. The slow fade (alpha: 3) means old trails linger for several seconds before disappearing. Fast movements leave thick, warm-colored strokes. Slow movements leave thin, cool trails. Stand still and the canvas gradually fades to dark. Dance for thirty seconds and you get a ghostly portrait of your movement -- swooping arcs from your arms, gentle curves from your torso, scattered jitters from your head nodding. The speed-to-color mapping means you can literally see which parts of the dance were energetic (red/orange) versus graceful (blue/teal).
Allez, time to combine everything. We're building a particle system (episode 11) driven by pose detection. Particles emit from both wrists. Left hand spawns cool blue particles, right hand spawns warm orange ones. Movement speed controls emission rate -- move faster, more particles. Confidence controls opacity -- confident keypoints make vivid particles, uncertain ones make ghost particles.
let particles = [];
let video, bodyPose;
let poses = [];
let smoothed = {};
function preload() {
bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
}
function setup() {
createCanvas(800, 600);
colorMode(HSB, 360, 100, 100, 100);
video = createCapture(VIDEO, { flipped: true });
video.size(320, 240);
video.hide();
bodyPose.detectStart(video, function(r) { poses = r; });
}
function emitFrom(kp, baseHue, speed) {
const count = Math.floor(map(speed, 0, 30, 0, 5));
for (let i = 0; i < count; i++) {
particles.push({
x: map(kp.x, 0, 320, 0, width),
y: map(kp.y, 0, 240, 0, height),
vx: random(-2, 2),
vy: random(-3, -0.5),
hue: baseHue + random(-15, 15),
size: random(3, 10),
life: 1.0,
alpha: kp.confidence * 80
});
}
}
function draw() {
background(0, 0, 5, 20);
if (poses.length > 0) {
const pose = poses[0];
const sm = smoothPose(pose);
const lw = getKeypoint(pose, 'left_wrist');
const rw = getKeypoint(pose, 'right_wrist');
if (lw.confidence > 0.3 && smoothed.left_wrist) {
const prevL = smoothed.left_wrist;
const speedL = dist(prevL.x, prevL.y, sm.left_wrist.x, sm.left_wrist.y);
emitFrom(lw, 210, speedL); // cool blue
}
if (rw.confidence > 0.3 && smoothed.right_wrist) {
const prevR = smoothed.right_wrist;
const speedR = dist(prevR.x, prevR.y, sm.right_wrist.x, sm.right_wrist.y);
emitFrom(rw, 25, speedR); // warm orange
}
smoothed = sm;
}
// update and draw particles
for (let i = particles.length - 1; i >= 0; i--) {
const p = particles[i];
p.x += p.vx;
p.vy += 0.03; // gentle gravity
p.y += p.vy;
p.life -= 0.008;
if (p.life <= 0) {
particles.splice(i, 1);
continue;
}
noStroke();
fill(p.hue % 360, 60, 70, p.life * p.alpha);
circle(p.x, p.y, p.size * p.life);
}
// particle count for debugging
fill(0, 0, 60, 40);
noStroke();
textSize(10);
textFont('monospace');
text('particles: ' + particles.length, 10, height - 10);
}
Dance for thirty seconds. Wave your arms. Punch the air. Then hold still and watch the particles drift downward and fade. Screenshot the result. What you have is a generative portrait of your dance -- the spatial pattern of where your hands went, the color split between left (blue) and right (orange), the density showing where you lingered verus where you swept through quickly. No two dances produce the same image. The artwork is a record of your specific body moving through a specific moment.
MoveNet can detect multiple bodies in a single frame. The poses array simply contains more than one entry. For creative installations, this is gold -- each person drives their own visual element, and the interaction between people creates emergent patterns that no single person could produce alone.
function draw() {
background(0, 0, 5, 15);
for (let p = 0; p < poses.length; p++) {
const pose = poses[p];
// assign each person a unique hue
const personHue = (p * 137.5) % 360; // golden angle spacing
for (const kp of pose.keypoints) {
if (kp.confidence > 0.4) {
noStroke();
fill(personHue, 55, 65, 40);
circle(
map(kp.x, 0, 320, 0, width),
map(kp.y, 0, 240, 0, height),
12
);
}
}
}
}
The golden angle spacing (137.5 degrees) ensures that even with many people, each gets a visually distinct hue. Person 0 gets hue 0 (red), person 1 gets 137 (green-ish), person 2 gets 275 (violet), and they keep spreading evenly around the color wheel. We used the same golden angle trick in episode 28 for generating harmonious color palettes.
A few practical things worth knowing.
Performance: MoveNet is heavier than MobileNet classification. Expect 10-20 fps for pose detection, not 30. If your sketch does complex rendering on top of that, total framerate might drop to single digits. Keep particle counts reasonable. Use noSmooth() if you're not doing anti-aliased drawing. Lower the video resolution -- 320x240 is enough for pose detection and much lighter than 640x480.
Lighting: Pose detection works best with even lighting and contrast between the body and background. Dark clothes against a dark wall = bad confidence. Bright clothing against a neutral background = good results. Side lighting that casts strong shadows can confuse the model about where limbs actually are.
Multiple people overlap: When two people overlap in the frame, the model sometimes merges their keypoints into a single franken-skeleton. There's no perfect fix -- it's a limitation of the model. If you're building a multi-person installation, try to keep people spaced apart. Or embrace the glitches -- merged skeletons can be visually interesting in their own right.
The privacy awareness from last episode applies double here. Pose detection extracts detailed body structure from video. Even though ml5 processes everything locally, the data itself -- joint positions over time -- is biometric. It can reveal gait patterns (which are unique like fingerprints), physical disabilities, emotional states (slouching vs upright posture). For installations, be transparent about what the model is detecting and make sure people can opt out by simply walking away from the camera.
Two episodes into the ML arc now. We went from flat category labels (episode 92) to spatial body tracking. The model doesn't just see "a person" -- it understands the structure of the body and gives us precise coordinates to work with. Every technique from the first ninety episodes applies. Particles, noise, color, trails, interaction -- we're just feeding them a different kind of input. And the inputs keep getting richer from here.
Sallukes! Thanks for reading.
X