Vision Node
The Vision node turns images into something a workflow can act on — text, structured data, captions, classifications, or a transformed image. Pick an operation (what you want done) and a provider (where it runs), then wire an image into the node.
Quick start
- Drop a Vision node onto the canvas.
- Wire an image upstream — from a File node, an Image node, a screenshot, a Webcam node, or a previous Vision step.
- Open the config panel and choose an operation. The dropdown is grouped:
- Recognize — pull literal content out of pixels (text, tables, charts)
- Reason — open-ended interpretation (describe, Q&A, extract, classify, moderate)
- Modify — image-to-image transforms (background removal, enhance, filters)
- Pick a provider if the operation supports more than one. On-device Apple is the default wherever it works.
Providers at a glance
| Provider | Where it runs | Cost | Strongest at |
|---|---|---|---|
| Apple Intelligence | On your Mac or iPad, on-device | Free | Background removal, image enhance, Core Image filters, printed-text OCR, barcodes |
| Cloud multimodal LLM | Routed through Circuitry cloud | Credits | Handwriting, math, complex layouts, reasoning, structured extraction |
| CLI multimodal | Claude Code / Codex / Gemini-CLI on your machine | Bundled with your CLI subscription | High-volume jobs, screenshot-to-code, repeatable batches without per-call billing |
Apple gives you privacy and cost-zero on-device processing. Cloud models give you reasoning and handwriting (billed in credits). CLI tools sit in between — local but powered by frontier models.
Operations
Recognize
| Operation | What it does | Providers |
|---|---|---|
| Recognize Text (OCR) | Extracts printed text. Returns { text, lines, confidence } | Apple (default), cloud, CLI |
| Recognize Handwriting | Cursive and sketches — frontier models excel here | Cloud, CLI |
| Math → LaTeX | Converts an equation image into LaTeX source | Cloud, CLI |
| Screenshot → Code | Transcribes code from a screenshot back to text | CLI (default), cloud |
| Translate Text in Image | Reads text and translates it in one step | Cloud, CLI |
| Chart → Data | Reverse-engineers a chart back into { series, data } | Cloud, CLI |
| Table → JSON | Reads a tabular image into rows and headers | Cloud, CLI |
Reason
| Operation | What it does | Providers |
|---|---|---|
| Describe Image | Produces a two-sentence caption | Cloud, CLI |
| Generate Alt Text | Accessibility caption under 100 characters | Cloud, CLI |
| Visual Q&A | Answers a question you type ("What's the total on this receipt?") | Cloud, CLI |
| Extract Structured Data | Returns JSON matching a schema you provide | Cloud, CLI |
| Classify (Open Labels) | Pick the best matching label from your own list | Cloud, CLI |
| Count Objects | Counts instances of whatever you specify | Cloud, CLI |
| Tag / Keyword | Suggests 5–10 short tags | Cloud, CLI |
| Content Moderation | Flags NSFW / violence / hate-symbols with per-category scores | Cloud, CLI |
| Mock → HTML | Turns a UI sketch into working markup | CLI (default), cloud |
| Mock → React Component | Same for a React/JSX component | CLI (default), cloud |
Modify
| Operation | What it does | Providers |
|---|---|---|
| Remove Background | Strips the background, returns a transparent image | Apple |
| Auto Enhance | Improves exposure, contrast, sharpness | Apple |
| Apply Filter | Sepia, mono, noir, vintage, blur, sharpen, vibrance | Apple |
Example workflows
1. Receipt scanner
Drop a photo of a receipt, get a row in your expense spreadsheet.
File ──▶ Vision (Extract, schema: { vendor, date, total, currency }) ──▶ Sheet
Set the schema field to { "vendor": "string", "date": "string", "total": "number", "currency": "string" }. The Vision node returns output.data ready for the Sheet to append.
2. Translate a foreign menu
Tourist mode in three nodes.
File or Webcam ──▶ Vision (Translate Text in Image, target: English) ──▶ Note
The Note shows the translated text. Swap the target language to French, Japanese, anything.
3. Screenshot to React component
Paste a UI mockup, get a JSX component back.
Image (paste screenshot) ──▶ Vision (Mock → React Component, provider: CLI) ──▶ Code (write file)
The Code node drops the generated component into your project. Run it again with a tweaked mockup and iterate.
4. Auto-tag a photo library
Folder (iterate images) ──▶ Vision (Tag / Keyword) ──▶ Sheet (path + tags)
The Tag operation returns labels: [{ label, confidence }]. The Sheet node can flatten that into a comma-separated cell.
5. Free, local receipt OCR at scale
Layer Apple's on-device OCR with cloud reasoning. Apple does the heavy lifting; the cloud only sees plain text.
Folder ──▶ Vision (OCR, provider: Apple) ──▶ Agent (clean + dedupe) ──▶ Sheet
6. Content moderation gate
Upload ──▶ Vision (Content Moderation) ──▶ Condition (block if flagged) ──▶ Publish
Vision returns { flagged, categories: [{ category, score }] }. The Condition node routes based on flagged.
Inputs and outputs
Input
The Vision node looks for an image on its incoming edge, checking — in this order — value, image, __image__, __chart__, chart, imageData, processedImage, output. Any upstream node that publishes one of those fields with a base64 image (or data URL) will feed the Vision node correctly.
Output
Output shape depends on the operation:
- OCR / Translate / Caption / Q&A →
{ text, lines?, confidence? } - Extract / Chart / Table →
{ data: <parsed JSON> } - Classify / Tag →
{ labels: [{ label, confidence }] } - Count →
{ count, target } - Moderate →
{ flagged, categories: [{ category, score }] } - Mock → HTML/JSX →
{ code, language } - Modify ops →
{ image, imageData, processedImage }
Output shape is the same regardless of provider for a given operation — switch from Apple OCR to GPT-4o OCR and downstream nodes don't need to change.
Forwarding images downstream
When you wire a Vision node into an Agent node, the Agent automatically picks up the image (if any) and forwards it to its multimodal model as a real attachment — not as base64 text in the prompt. This means a Vision node doing background removal can hand its result straight to an Agent for further analysis with no glue.
Tips
- Privacy-sensitive workloads → Apple. Faces, IDs, medical images: keep them on-device.
- Handwriting → cloud or CLI. Apple's OCR is print-leaning; modern VLMs handle cursive much better.
- High volume → CLI. If you're tagging thousands of photos, the CLI provider amortizes through your existing tool subscription.
- Structured extraction → write a schema. A two-line schema hint reliably shapes the JSON.
- Multiple steps are cheap. It's common to chain two Vision steps — for example "auto-crop document" → "OCR the crop."
Related
- Image Node — feed images into the Vision node
- Agent Node — receives Vision output (text or images) for further reasoning
- Sheet Node — common destination for OCR / extract output
- Apple Intelligence — what's on-device, how to check availability