How MediaPipe AI Detection Works in Stop Biting: A Technical Explainer

What is MediaPipe and why does it matter?

MediaPipe is an open-source machine learning framework developed by Google Research, designed for real-time perception tasks — detecting, tracking, and understanding objects in camera streams. Originally developed for Google's own products (Pixel's portrait mode, Google Meet's background blur), it was open-sourced in 2019 and has become the dominant framework for on-device computer vision in web and mobile applications.

The key property that makes MediaPipe relevant for a nail biting detection app is its speed and its architecture: all models run locally on the user's device, with no cloud dependency, achieving 30–60 frames per second on standard consumer hardware.

The two models used for detection

Stop Biting uses two MediaPipe models in combination. The hand landmark model detects and tracks the hand in the camera frame, identifying 21 keypoints — fingertips, knuckle joints, wrist — with sub-centimetre accuracy. The face mesh model detects 468 facial landmarks including the precise location of the lips, mouth corners, and chin.

The detection logic computes the geometric distance between the fingertip landmarks and the mouth landmark cluster in real time. When this distance falls below a calibrated threshold — meaning the fingers are within typical nail-biting proximity of the mouth — the detection event fires. The threshold is designed to minimize both false positives and false negatives.

WebAssembly: why the models run so fast

MediaPipe's models are compiled to WebAssembly (WASM), a binary instruction format that executes at near-native speed in web browsers and Node.js environments. WASM provides roughly 50–80% of native C++ performance in the browser — fast enough to run both landmark detection models simultaneously at 30+ frames per second on a standard laptop.

The WASM binary includes SIMD (Single Instruction Multiple Data) optimizations for compatible CPUs, allowing multiple pixel operations to be performed in parallel in a single instruction cycle. The app includes both SIMD and non-SIMD WASM builds, selecting the appropriate version based on the browser's capability detection.

Model size and loading time

The face landmark model is approximately 3.9MB, and the hand landmark model is approximately 8.4MB. Both are loaded from local storage (bundled with the app) rather than from the network during each session. First-load time for model initialization is typically 1–3 seconds on modern hardware; subsequent loads use cached models and are near-instantaneous.

The WASM runtime itself adds approximately 6MB of runtime payload. Total cold-start overhead (loading WASM + both models) is typically under 5 seconds, after which detection runs continuously at full frame rate.

Accuracy and detection limitations

MediaPipe's hand and face landmark models are trained on diverse datasets and perform well across skin tones, lighting conditions, and camera angles. However, detection accuracy degrades in specific conditions: very low light, extreme camera angles (more than 45° off-axis from the face), and partial occlusion of the hand.

The most common source of false positives is touching the face in the mouth-adjacent area without biting — scratching the chin, resting a hand on the cheek, or eating. Users typically calibrate their mental model of the detection system within the first few sessions. False negatives most commonly occur when the hand approaches from below the camera frame or when lighting creates shadows that reduce landmark confidence below the detection threshold.