Last episode we tracked the body as a whole -- 17 keypoints from nose to ankles, mapped to particles, trails, colors, gestures. Pose detection gave us spatial awareness: where the shoulders are, how wide the arms are spread, whether hands are up or down. Useful. But coarse. The whole hand is a single keypoint: "left_wrist." Everything below that wrist -- five fingers, fourteen joints, a thumb that opposes in ways that took evolution millions of years to figure out -- compressed into one dot.
Hand tracking decompresses that dot into 21 keypoints per hand. Four joints per finger plus the wrist. Fingertips, knuckles, the base of each finger, the heel of the palm. Enough resolution to tell whether your index finger is extended or curled, whether you're pinching your thumb and forefinger together, whether you're showing three fingers or four. The jump from 1 point to 21 is the difference between knowing where someone's hand IS and knowing what their hand is DOING. And what hands do -- gestures, grips, pointing, signing, counting -- is one of the richest interaction vocabularies humans have.
ml5 wraps MediaPipe's hand model in the same API pattern we've been using. Load model, feed video, get results. The difference is what comes back: not body keypoints but finger-level landmarks, two hands independently tracked, with enough spatial detail to build interactions that feel genuinely physical. Your fingers become individual controllers. Pinch to grab. Point to direct. Spread to expand. Close fist to stop. No mouse, no keyboard, no touchscreen. Just your hands in front of a camera.
The setup follows the exact pattern from the last two episodes. Load the model in preload, create a video capture, start continuous detection. The API call is ml5.handPose instead of ml5.bodyPose, but the structure is identical.
let video;
let handPose;
let hands = [];
function preload() {
handPose = ml5.handPose({ flipped: true });
}
function setup() {
createCanvas(640, 480);
video = createCapture(VIDEO, { flipped: true });
video.size(640, 480);
video.hide();
handPose.detectStart(video, function(results) {
hands = results;
});
}
function draw() {
image(video, 0, 0);
// draw all detected keypoints
for (const hand of hands) {
for (const kp of hand.keypoints) {
noStroke();
fill(100, 220, 180, 180);
circle(kp.x, kp.y, 6);
}
}
}
Run this, hold your hand up to the webcam, and you see 21 green dots appear on your fingers and palm. Wiggle your fingers -- the dots follow. The model detects both hands independently, so hold up two hands and you get 42 keypoints total. The detection runs at roughly 15-25 fps depending on your hardware, same ballpark as pose detection.
One thing you'll notice immediately: hand tracking is more sensitive to lighting than body pose was. Your hand is smaller, fingers are thinner, and the model needs to distinguish between joints that are only a few pixels apart. Good frontal lighting helps. A hand silhouetted against a bright window confuses the model badly.
Each hand returns 21 keypoints in a consistent order. The naming follows a pattern: wrist first, then each finger from thumb to pinky, four joints each moving from base to tip.
// the 21 keypoints and their indices
// 0: wrist
// thumb: 1 (CMC), 2 (MCP), 3 (IP), 4 (tip)
// index: 5 (MCP), 6 (PIP), 7 (DIP), 8 (tip)
// middle: 9 (MCP), 10 (PIP), 11 (DIP), 12 (tip)
// ring: 13 (MCP), 14 (PIP), 15 (DIP), 16 (tip)
// pinky: 17 (MCP), 18 (PIP), 19 (DIP), 20 (tip)
// fingertip indices for quick access
const TIPS = {
thumb: 4,
index: 8,
middle: 12,
ring: 16,
pinky: 20
};
// base joint (MCP) indices
const BASES = {
thumb: 1,
index: 5,
middle: 9,
ring: 13,
pinky: 17
};
The letters stand for joint names from anatomy: CMC (carpometacarpal), MCP (metacarpophalangeal), PIP (proximal interphalangeal), DIP (distal interphalangeal). You don't need to memorize those -- just know that for each finger, index 0 is closest to the palm and the highest index is the tip. The tip is what you'll use most for interaction. The base is what you compare against to figure out if a finger is extended or curled.
Let me draw the skeleton properly so you can see the structure:
// connections between keypoints for drawing the hand skeleton
const FINGER_CONNECTIONS = [
// thumb
[0, 1], [1, 2], [2, 3], [3, 4],
// index
[0, 5], [5, 6], [6, 7], [7, 8],
// middle
[0, 9], [9, 10], [10, 11], [11, 12],
// ring
[0, 13], [13, 14], [14, 15], [15, 16],
// pinky
[0, 17], [17, 18], [18, 19], [19, 20],
// palm base connections
[5, 9], [9, 13], [13, 17]
];
function drawHandSkeleton(hand) {
// draw bones
stroke(100, 200, 170, 150);
strokeWeight(2);
for (const pair of FINGER_CONNECTIONS) {
const a = hand.keypoints[pair[0]];
const b = hand.keypoints[pair[1]];
line(a.x, a.y, b.x, b.y);
}
// draw joints -- tips bigger than knuckles
for (let i = 0; i < hand.keypoints.length; i++) {
const kp = hand.keypoints[i];
noStroke();
if (i === 4 || i === 8 || i === 12 || i === 16 || i === 20) {
// fingertips: larger, warmer color
fill(240, 130, 100, 200);
circle(kp.x, kp.y, 10);
} else if (i === 0) {
// wrist
fill(180, 180, 220, 200);
circle(kp.x, kp.y, 10);
} else {
// joints
fill(150, 200, 180, 180);
circle(kp.x, kp.y, 6);
}
}
}
The palm base connections (5-9, 9-13, 13-17) connect the MCP joints across the top of the palm. Without them, the hand looks like five disconnected sticks radiating from the wrist. With them, you get a recognizable hand shape -- a palm with five fingers attached. Small detail, big difference in how the skeleton reads visually.
The most basic question you can ask about a finger: is it sticking out or curled in? This is what lets you count to five, detect a fist, recognize a peace sign, or read sign language hand shapes. The principle is simple geometry -- compare the fingertip position to the finger's base joint.
function isFingerExtended(hand, fingerName) {
const tipIdx = TIPS[fingerName];
const baseIdx = BASES[fingerName];
const tip = hand.keypoints[tipIdx];
const base = hand.keypoints[baseIdx];
if (fingerName === 'thumb') {
// thumb extends sideways, not up/down
// compare horizontal distance from wrist
const wrist = hand.keypoints[0];
return Math.abs(tip.x - wrist.x) > Math.abs(base.x - wrist.x) * 1.2;
}
// for other fingers: tip above base = extended (y decreases upward)
return tip.y < base.y - 20;
}
function countExtendedFingers(hand) {
let count = 0;
for (const name of Object.keys(TIPS)) {
if (isFingerExtended(hand, name)) count++;
}
return count;
}
For the four main fingers (index through pinky), "extended" means the tip is above the base -- the finger is sticking up. "Curled" means the tip is below or at the same level as the base -- the finger is folded down. The 20-pixel threshold prevents jittery switching when a finger is right at the boundary.
The thumb is special. It extends sideways rather than upward, so you compare horizontal distance from the wrist instead of vertical position. An extended thumb sticks out to the side; a curled thumb tucks against the palm. The 1.2 multiplier requires the tip to be at least 20% further from the wrist than the base is, which prevents false positives when the thumb is relaxed but not deliberately extended.
This gives you five boolean states -- one per finger -- which means 32 possible hand shapes (2 to the power of 5). That's already a huge interaction vocabulary. Let's visualize it:
function draw() {
background(15, 18, 25);
image(video, 0, 0, 320, 240);
if (hands.length === 0) return;
const hand = hands[0];
drawHandSkeleton(hand);
// show finger states
const fingers = ['thumb', 'index', 'middle', 'ring', 'pinky'];
const states = fingers.map(function(name) {
return isFingerExtended(hand, name);
});
const count = states.filter(function(s) { return s; }).length;
// display finger states as a row of indicators
for (let i = 0; i < 5; i++) {
const x = 360 + i * 50;
const y = 40;
fill(states[i] ? color(100, 220, 160) : color(80, 60, 70));
noStroke();
rect(x, y, 40, 40, 5);
fill(220);
textSize(9);
textFont('monospace');
textAlign(CENTER, CENTER);
text(fingers[i].substring(0, 3), x + 20, y + 20);
}
// number of extended fingers
fill(220);
textSize(48);
textAlign(CENTER);
text(count.toString(), 480, 140);
textSize(11);
fill(130);
text('extended fingers', 480, 165);
}
Hold up your hand and make a fist -- shows 0. Extend your index -- shows 1. Peace sign -- shows 2. Show your whole hand -- shows 5. The indicator boxes light up green for each extended finger. It's simple but satisfying -- the model is reading your hand shape in real time and responding with the correct count. This is the foundation for everything else in this episode.
The most natural fine-grained gesture is the pinch -- touching your thumb tip to your index fingertip. It's how humans pick up small objects, and it translates beautifully to digital interaction. "Pinch to grab" feels intuitive because it mirrors the physical action.
function getPinchDistance(hand) {
const thumbTip = hand.keypoints[TIPS.thumb];
const indexTip = hand.keypoints[TIPS.index];
return dist(thumbTip.x, thumbTip.y, indexTip.x, indexTip.y);
}
function isPinching(hand, threshold) {
threshold = threshold || 30;
return getPinchDistance(hand) < threshold;
}
The threshold of 30 pixels works well at normal webcam distance (arm's length). If you're closer, fingertips appear larger in the frame and you might need a higher threshold. If you're further away, lower it. You can also normalize by palm size (distance from wrist to middle finger base) to make it distance-independent -- same principle as normalizing by shoulder width in pose detection (episode 93).
Let's use pinch to control a drawing app:
let drawingCanvas;
let prevPinchPos = null;
function setup() {
createCanvas(640, 480);
drawingCanvas = createGraphics(640, 480);
drawingCanvas.background(15, 18, 25);
video = createCapture(VIDEO, { flipped: true });
video.size(640, 480);
video.hide();
handPose = ml5.handPose({ flipped: true });
handPose.detectStart(video, function(r) { hands = r; });
}
function draw() {
if (hands.length > 0) {
const hand = hands[0];
const thumbTip = hand.keypoints[TIPS.thumb];
const indexTip = hand.keypoints[TIPS.index];
// pinch position is midpoint between thumb and index
const px = (thumbTip.x + indexTip.x) / 2;
const py = (thumbTip.y + indexTip.y) / 2;
if (isPinching(hand)) {
// draw on the canvas
if (prevPinchPos) {
drawingCanvas.stroke(180, 220, 255, 180);
drawingCanvas.strokeWeight(3);
drawingCanvas.line(prevPinchPos.x, prevPinchPos.y, px, py);
}
prevPinchPos = { x: px, y: py };
} else {
prevPinchPos = null;
}
}
image(drawingCanvas, 0, 0);
// show a small indicator for pinch state
if (hands.length > 0 && isPinching(hands[0])) {
fill(100, 255, 180, 60);
noStroke();
circle(30, 30, 20);
}
}
Pinch your thumb and index finger together and move your hand -- you draw a line. Release the pinch and the line stops. Pinch again somewhere else and you start a new stroke. It's finger painting, literally. The midpoint between thumb and index gives you a stable drawing position that doesn't jump when you first make contact. Without the midpoint, the drawing position would snap between thumb and index depending on which one moves first.
Counting extended fingers is nice, but specific combinations have meaning. Peace sign (index + middle extended), thumbs up (only thumb extended), rock sign (index + pinky extended), okay sign (thumb + index forming a circle, others extended). Let's build a hand shape classifier:
function classifyHandShape(hand) {
const ext = {};
for (const name of Object.keys(TIPS)) {
ext[name] = isFingerExtended(hand, name);
}
// thumbs up: only thumb extended
if (ext.thumb && !ext.index && !ext.middle && !ext.ring && !ext.pinky) {
return 'thumbs_up';
}
// peace / victory: index + middle only
if (!ext.thumb && ext.index && ext.middle && !ext.ring && !ext.pinky) {
return 'peace';
}
// rock: index + pinky only
if (!ext.thumb && ext.index && !ext.middle && !ext.ring && ext.pinky) {
return 'rock';
}
// open hand: all five extended
if (ext.thumb && ext.index && ext.middle && ext.ring && ext.pinky) {
return 'open';
}
// fist: nothing extended
if (!ext.thumb && !ext.index && !ext.middle && !ext.ring && !ext.pinky) {
return 'fist';
}
// pointing: only index extended
if (!ext.thumb && ext.index && !ext.middle && !ext.ring && !ext.pinky) {
return 'pointing';
}
// three: index + middle + ring
if (!ext.thumb && ext.index && ext.middle && ext.ring && !ext.pinky) {
return 'three';
}
return 'unknown';
}
Seven named shapes plus a fallback. Each shape is a specific combination of which fingers are up and which are down. The order of checks matters -- thumbs_up is checked before open because both have the thumb extended, but thumbs_up requires the others to be curled.
Now map each shape to a different visual response:
const shapeVisuals = {
thumbs_up: { bg: [40, 80, 60], particle: [60, 200, 120], msg: 'nice :-)' },
peace: { bg: [40, 50, 90], particle: [120, 150, 255], msg: 'peace' },
rock: { bg: [80, 30, 30], particle: [255, 100, 80], msg: 'rock on' },
open: { bg: [50, 50, 70], particle: [200, 200, 230], msg: 'hello' },
fist: { bg: [15, 15, 20], particle: [60, 60, 80], msg: 'closed' },
pointing: { bg: [60, 40, 70], particle: [200, 130, 255], msg: 'there' },
three: { bg: [50, 60, 40], particle: [150, 200, 100], msg: '3' },
unknown: { bg: [25, 25, 30], particle: [100, 100, 120], msg: '?' }
};
function draw() {
if (hands.length === 0) {
background(20);
return;
}
const hand = hands[0];
const shape = classifyHandShape(hand);
const vis = shapeVisuals[shape];
background(vis.bg[0], vis.bg[1], vis.bg[2], 30);
// emit particles from each fingertip
for (const tipIdx of Object.values(TIPS)) {
const tip = hand.keypoints[tipIdx];
noStroke();
fill(vis.particle[0], vis.particle[1], vis.particle[2], 40);
circle(tip.x, tip.y, random(5, 20));
}
// shape label
fill(200);
noStroke();
textSize(14);
textFont('monospace');
textAlign(LEFT);
text(vis.msg, 20, height - 20);
}
Show a peace sign and the background goes blue with blue particles trailing from your fingertips. Make a fist and everything goes dark. Thumbs up and green particles bloom. Rock sign paints in red. The visual response is immediate and shape-specific. Each gesture creates a different aesthetic world. See where this is going? :-)
Hand tracking is jittery -- more so than pose detection. The fingers are small targets, the joints are close together, and the model sometimes jumps between interpretations from frame to frame. Raw keypoint positions vibrate visibly, especially at the fingertips where small model errors translate to large pixel jumps.
The fix is the same lerp smoothing from episode 93, applied per-keypoint:
let smoothedHands = [{}, {}]; // up to 2 hands
function smoothHand(hand, handIdx) {
const sm = smoothedHands[handIdx];
for (let i = 0; i < hand.keypoints.length; i++) {
const kp = hand.keypoints[i];
const key = 'kp' + i;
if (!sm[key]) {
sm[key] = { x: kp.x, y: kp.y };
}
const amt = 0.35;
sm[key].x = lerp(sm[key].x, kp.x, amt);
sm[key].y = lerp(sm[key].y, kp.y, amt);
}
return sm;
}
function getSmoothedKeypoint(smoothed, idx) {
return smoothed['kp' + idx];
}
A lerp amount of 0.35 keeps things responsive while damping most of the jitter. For drawing apps you might go lower (0.2) for smoother strokes. For gesture detection you want higher (0.5) so the shape classification responds quickly to finger movement. I find 0.35 is a decent middle ground for most interactive pieces.
The model tracks both hands independently. The hands array can contain zero, one, or two entries. Two-hand interaction opens up expressive possibilities that single-hand can't match: the distance between hands, whether they mirror each other, the combined shape of ten fingers instead of five.
function draw() {
background(12, 15, 22, 25);
if (hands.length < 2) {
fill(80);
textSize(12);
textFont('monospace');
text('show both hands', 20, height - 20);
return;
}
const left = hands[0];
const right = hands[1];
// distance between palm centers (wrists)
const lw = left.keypoints[0];
const rw = right.keypoints[0];
const handDist = dist(lw.x, lw.y, rw.x, rw.y);
// map hand distance to visual parameter
// close together = tight pattern, far apart = expanded
const spread = map(handDist, 30, 500, 0, 1, true);
const numRings = Math.floor(lerp(3, 25, spread));
const maxRadius = lerp(30, 250, spread);
// center point between hands
const cx = (lw.x + rw.x) / 2;
const cy = (lw.y + rw.y) / 2;
// count extended fingers on each hand
const leftCount = countExtendedFingers(left);
const rightCount = countExtendedFingers(right);
const totalFingers = leftCount + rightCount;
// draw concentric rings -- count determines color, distance determines size
for (let i = 0; i < numRings; i++) {
const t = i / numRings;
const r = t * maxRadius;
const hue = (totalFingers * 36 + t * 60 + frameCount * 0.5) % 360;
noFill();
colorMode(HSB, 360, 100, 100, 100);
stroke(hue, 55, 65, 25);
strokeWeight(1.5);
ellipse(cx, cy, r * 2, r * 2);
colorMode(RGB);
}
// show finger count from each hand
fill(160);
noStroke();
textSize(10);
textFont('monospace');
text('L:' + leftCount + ' R:' + rightCount, 20, height - 20);
text('dist: ' + handDist.toFixed(0), 20, height - 35);
}
Hold both hands in front of the camera, fingers spread. Move them apart and the rings expand outward from the center. Bring them close togther and the rings compress into a tight cluster. Change the number of extended fingers and the color shifts -- all ten up gives one palette, all closed gives another. The two-hand interaction creates a control surface with multiple parameters: distance, position, and ten individual finger states. That's more expressive than a MIDI controller.
Each keypoint includes a z coordinate -- depth relative to the camera. It's less precise than x and y (the model estimates depth from a 2D image, which is inherently ambiguous), but it's usable for rough depth gestures: push forward, pull back.
function draw() {
background(12, 15, 22);
if (hands.length === 0) return;
const hand = hands[0];
for (let i = 0; i < hand.keypoints.length; i++) {
const kp = hand.keypoints[i];
// z is typically negative (closer = more negative)
// normalize to a usable range
const depth = kp.z || 0;
const normDepth = map(depth, -100, 0, 1, 0, true);
// closer = bigger and brighter, further = smaller and dimmer
const size = lerp(4, 18, normDepth);
const alpha = lerp(40, 220, normDepth);
noStroke();
fill(150, 200, 230, alpha);
circle(kp.x, kp.y, size);
}
}
Push your hand toward the camera and the dots grow larger and brighter. Pull it back and they shrink and fade. The effect is subtle because z-precision is lower than x/y, but it adds a layer of depth that pure 2D tracking doesn't give you. Map z to zoom level, layer depth in a parallax scene, or intensity of an effect. Just don't rely on z for anything that needs pixel-accurate precision -- use it as a rough analog control rather than a precise position.
Allez, let's build something that pulls this all together. A visual instrument where each finger controls a different visual element. Pinch thumb to a finger to activate that element. The more fingers you activate, the more complex the visual becomes. Open hand resets everything. Fist mutes all.
let particles = [];
let video, handPose;
let hands = [];
let channels = {
index: { active: false, hue: 0, particles: [] },
middle: { active: false, hue: 80, particles: [] },
ring: { active: false, hue: 160, particles: [] },
pinky: { active: false, hue: 240, particles: [] }
};
function preload() {
handPose = ml5.handPose({ flipped: true });
}
function setup() {
createCanvas(800, 600);
colorMode(HSB, 360, 100, 100, 100);
video = createCapture(VIDEO, { flipped: true });
video.size(320, 240);
video.hide();
handPose.detectStart(video, function(r) { hands = r; });
}
function isThumbTouching(hand, fingerTipIdx) {
const thumb = hand.keypoints[TIPS.thumb];
const finger = hand.keypoints[fingerTipIdx];
return dist(thumb.x, thumb.y, finger.x, finger.y) < 25;
}
function draw() {
background(0, 0, 5, 20);
if (hands.length > 0) {
const hand = hands[0];
// check thumb-to-finger pinches for each channel
channels.index.active = isThumbTouching(hand, TIPS.index);
channels.middle.active = isThumbTouching(hand, TIPS.middle);
channels.ring.active = isThumbTouching(hand, TIPS.ring);
channels.pinky.active = isThumbTouching(hand, TIPS.pinky);
// fist check: everything curled = mute all
if (countExtendedFingers(hand) === 0) {
for (const ch of Object.values(channels)) {
ch.active = false;
}
}
// emit particles from active channels
for (const name of Object.keys(channels)) {
const ch = channels[name];
if (!ch.active) continue;
const tipIdx = TIPS[name];
const tip = hand.keypoints[tipIdx];
// map tip position from video coords to canvas coords
const ex = map(tip.x, 0, 320, 0, width);
const ey = map(tip.y, 0, 240, 0, height);
for (let i = 0; i < 3; i++) {
ch.particles.push({
x: ex,
y: ey,
vx: random(-2, 2),
vy: random(-3, -0.5),
size: random(4, 12),
life: 1.0,
hue: ch.hue + random(-15, 15)
});
}
}
}
// update and draw all channel particles
for (const ch of Object.values(channels)) {
for (let i = ch.particles.length - 1; i >= 0; i--) {
const p = ch.particles[i];
p.x += p.vx;
p.vy += 0.02;
p.y += p.vy;
p.life -= 0.006;
if (p.life <= 0) {
ch.particles.splice(i, 1);
continue;
}
noStroke();
fill(p.hue % 360, 60, 70, p.life * 60);
circle(p.x, p.y, p.size * p.life);
}
}
// channel indicators
let ix = 20;
for (const name of Object.keys(channels)) {
const ch = channels[name];
fill(ch.hue, ch.active ? 60 : 15, ch.active ? 65 : 25, 60);
noStroke();
rect(ix, height - 30, 50, 20, 3);
fill(0, 0, 90, 50);
textSize(9);
textFont('monospace');
textAlign(CENTER, CENTER);
text(name.substring(0, 3), ix + 25, height - 20);
ix += 60;
}
}
Touch your thumb to your index finger -- red particles start flowing from the index fingertip. Touch thumb to middle -- green particles from the middle finger. Touch thumb to ring and pinky and you get blue and purple streams too. Each finger is an independent channel. Activate one for a single stream, activate all four for a cascade of colours. Move your hand while pinching and the particle emitter follows your finger through space. Close your fist and everything goes quiet. Open your hand and all channels deactivate (because nothing is pinching anymore).
The result is a hand-controlled visual synthesizer. Like a theremin but for particles. Each finger is a key, and the spatial position of each fingertip is the parameter. You perform the visuals rather than configuring them. The artwork is the gesture itself, captured as cascading particles that drift and fade.
A few things worth knowing from experience.
Occlusion is the enemy. When fingers overlap from the camera's perspective -- which happens constantly in natural hand positions -- the model can't distinguish them. Curled fingers behind the palm become invisible. The model guesses, and it often guesses wrong. Flat hand toward the camera = best tracking. Edge-on hand = worst. Design your interactions around palm-facing-camera positions and you'll have a much better time.
Single hand is more reliable than two. Tracking two hands simultaneously requires the model to resolve which keypoint belongs to which hand. When hands cross or overlap, it sometimes swaps them -- your left hand's keypoints suddenly jump to the right hand. If your piece requires two-hand interaction, keep the hands spatially separated.
Performance budget. Hand tracking is heavier than pose detection. Expect 10-20 fps for detection. If your visual rendering is also demanding (hundreds of particles, multiple canvases, shader effects), the total frame rate drops further. Keep the video input small (320x240), run detection on every other frame if needed, and keep your particle counts under control.
The webcam constraint. Everything we build here requires the user to sit in front of a webcam with their hands visible. That's a limited interaction context compared to a touchscreen or mouse. But within that context, the expressive range is extraordinary -- 21 keypoints per hand, continuous positions, the ability to detect shapes, gestures, pinches, depth, two-hand relationships. It's a different kind of interface, not a worse one.
Three episodes into the ML arc now. We went from category labels (episode 92) to body skeletons (episode 93) to individual finger joints. Each step gave us finer control, more detailed data, richer interaction possibilities. The pattern is the same as always -- the model produces structured data, you map it to creative output using every technique we've built since episode 10. Hands gave us 21 points of control per hand. The face, with its 468 landmarks, gives us even more.
Sallukes! Thanks for reading.
X