Running ML models in the browser alongside a real-time AR pipeline sounds like a recipe for dropped frames. And it is, if you do it the obvious way.
I have shipped a dozen AR experiences that use on-device AI - face mesh tracking, hand pose estimation, object detection, even simple scene understanding. The key is treating the ML model as a constrained resource, not a magic box. Every model has a latency cost, and in AR, you are already spending most of your frame budget on rendering.
Here is the first rule: your total frame budget at 60fps is 16.6 milliseconds. Rendering typically takes 8-12ms. That leaves 4-6ms for everything else - input handling, state updates, and your ML inference. If your model takes 20ms per inference, you have already blown the budget.
The fix is not to run inference every frame. Most tracking use cases do not need 60Hz updates. Face mesh at 30Hz looks perfectly smooth. Hand pose at 15-20Hz is fine for gesture recognition. I run inference on a separate requestAnimationFrame loop, decoupled from the render loop, and interpolate between results. The user sees smooth tracking. The GPU sees manageable load.
Model selection matters more than model accuracy. TensorFlow.js offers multiple backends: WebGL, WASM, and now WebGPU. For most AR applications, the WebGL backend is the sweet spot. It runs inference on the GPU alongside your rendering, which sounds like a conflict but actually works well because inference and rendering rarely peak at the same time within a frame.
Quantization is non-negotiable. A full-precision face mesh model might be 4MB and take 25ms per inference. The quantized int8 version is 1MB and runs in 8ms with negligible accuracy loss. TensorFlow.js makes this easy - most of their pre-trained models ship quantized variants. If you are training custom models, quantize before export. Always.
Fallback strategies are where most teams skip and most users suffer. Not every phone has the same GPU capabilities. I test on three tiers: flagship (iPhone 15 Pro, Galaxy S24), mid-range (Pixel 7a, Galaxy A54), and low-end (two-year-old budget Androids). If a device cannot hit 30fps with the full model, I fall back. Simpler model first. Static overlay second. Graceful message third. Never a white screen.
The architecture I have settled on looks like this: a main render loop driving PlayCanvas or Three.js, a separate inference worker running TensorFlow.js on a requestAnimationFrame callback at a target frequency, a shared state buffer that the render loop reads from, and an interpolation layer that smooths between inference results. Clean separation. Predictable performance.
One pattern that saved me repeatedly is what I call eager warm-up. ML models in TensorFlow.js have a cold start - the first inference is 3-5x slower as the GPU compiles shaders and allocates memory. I run a dummy inference during the loading screen, before the AR camera even opens. By the time the user sees the experience, the model is warm and fast.
The last piece is measurement. I log inference time, render time, and dropped frames per session and per device tier. Not in development. In production. This data tells you exactly where your budget is being spent and which devices need fallback paths. Without it, you are guessing.
