Computer Vision Nodes in Stitch: ML Pipelines and the Bounding Box Debugging Story

Published on May 25, 2026

Computer Vision Nodes in Stitch:

Building CoreML and Vision-powered detection nodes for a visual prototyping tool, and the investigation that fixed bounding box alignment.

What is Stitch?

Stitch is an open-source visual prototyping tool in the spirit of Origami Studio and Quartz Composer. Instead of writing code, designers and developers wire together nodes in a graph to build interactive prototypes. During my 2+ years on the project I built the computer vision pieces – CoreML and Vision-powered detection nodes, QR and face detection, and the coordinate-conversion plumbing that ties them to SwiftUI overlays. Active development on Stitch has since wound down; the code remains open source.


Why a Node-Based Tool is a Natural Fit for Computer Vision

A visual prototyping tool like Stitch is a natural fit for computer vision pipelines. The data flow maps directly: camera input goes in, ML inference happens in the middle, structured data comes out. I built a set of nodes so that a graph author can wire a camera feed into a CoreML node, fan the detection results out to overlay layers, and see bounding boxes or labels rendered on screen; allowing prototypers the ability to build a Computer Vision pipeline without writing code.

The challenge was making the coordinate math invisible. Vision frameworks output normalized coordinates, camera feeds arrive in pixel space, and SwiftUI layouts use their own coordinate system. The nodes need to handle all of that translation so the graph author only thinks about “camera in, detections out.”


Image Classification Node

The CoreMLClassificationNode accepts a Core ML model file plus a camera/image input and outputs detected labels, confidence scores, positions, and bounding boxes. Under the hood, a dedicated VisionOpActor runs VNCoreMLRequest on a background queue, reusing requests and keeping state actor-isolated for thread safety.

Here’s what it looks like in an actual graph - connected to a front-facing camera feed, running inference on a MobileNetV2 CoreML model, and displaying the inference results.

Object Detection Node

The object detection path uses the same camera-to-Core ML pipeline, but returns detections with class labels, confidence scores, locations, and bounding boxes. In this clip, the live camera feed detects objects on a table and renders labels and bounding boxes over the preview in full-screen mode. The default model the node ships with is YOLOv3Tiny.

QR Code Detection

The QRCodeDetectionNode wraps Vision’s barcode detection request (scoped to the .qr symbology) and returns a boolean detection flag, the decoded message, corner positions, and bounding box size. Like the CoreML node, it flips the Y-axis and translates bounding boxes to match Stitch’s coordinate system, making it trivial to drive highlights or UI prompts from QR scans.

The QR node shares the same coordinate conversion pipeline as the CoreML node, which was an early validation that the transformRect approach was general enough to serve as the standard conversion layer across all vision features.

Coordinate Conversion

Bounding boxes from Vision arrive in normalized coordinates (0-1 range, origin at bottom-left). Stitch’s UI uses SwiftUI layout coordinates (origin at top-left, sized to the view). The shared transformRect helper bridges this gap, flipping the Y-axis and scaling to the target view size in a single pass. This same math is reused across every vision node in the system.

The same conceptual bounding box plotted in Vision's normalized bottom-left space and SwiftUI's top-left view-pixel space, with the transformRect helper bridging them

It worked in my demo. Why was production drifting?

This was one of the more instructive debugging episodes I worked through. The setup: I had built a standalone demo app to prototype object detection, and it rendered bounding boxes perfectly aligned with detected objects. But Stitch’s overlay drifted – especially when tilting the device along the Y-axis.

The Three Differences

The tricky part was separating “what differed between the demo and Stitch” from “what actually fixed Stitch.” The reference demo and Stitch differed in three places, but only one of those variables changed in the final fix:

Aspect Reference demo Stitch before Stitch after Result
Rendering GeometryReader + Path Rectangle().position() GeometryReader + Path Root cause
Buffer source Direct pixel buffers CIImage conversion CIImage conversion Ruled out
Overlay target AVCaptureVideoPreviewLayer SwiftUI Image SwiftUI Image Ruled out

The last column is the important part. Buffer conversion and overlay target were differences between the known-good demo and Stitch, but they stayed the same when Stitch was fixed. That ruled them out. The drift disappeared only when Stitch stopped drawing the already-converted box with Rectangle().position() and started drawing it with explicit coordinates in a Path.

The Fix

The fix was to adopt the GeometryReader approach and the transformRect(fromRect:toViewSize:) helper that flips the Y-axis and computes size/position offsets in a single pass. This math became the standard coordinate conversion used throughout Stitch’s vision pipeline – the same helper that the CoreML and QR nodes now rely on.

The Value of a Reference App

Building the standalone demo app wasn’t just useful for prototyping – it became the debugging tool that cracked the alignment problem. By having a known-good implementation to compare against, I could systematically isolate each variable (rendering approach, buffer source, overlay target) until the root cause was clear. This is a pattern I’d use again: when production behavior diverges from expected behavior, build the simplest possible version that works, then diff the two.


This is one of four posts about my work at Stitch. See also: Building a 3D System Inside Stitch, Bringing AR to Stitch, and Building StitchAI.