Connecting to dev server...

Vision Node

The Vision node turns images into something a workflow can act on — text, structured data, captions, classifications, or a transformed image. Pick an operation (what you want done) and a provider (where it runs), then wire an image into the node.

Quick start

  1. Drop a Vision node onto the canvas.
  2. Wire an image upstream — from a File node, an Image node, a screenshot, a Webcam node, or a previous Vision step.
  3. Open the config panel and choose an operation. The dropdown is grouped:
    • Recognize — pull literal content out of pixels (text, tables, charts)
    • Reason — open-ended interpretation (describe, Q&A, extract, classify, moderate)
    • Modify — image-to-image transforms (background removal, enhance, filters)
  4. Pick a provider if the operation supports more than one. On-device Apple is the default wherever it works.

Providers at a glance

ProviderWhere it runsCostStrongest at
Apple IntelligenceOn your Mac or iPad, on-deviceFreeBackground removal, image enhance, Core Image filters, printed-text OCR, barcodes
Cloud multimodal LLMRouted through Circuitry cloudCreditsHandwriting, math, complex layouts, reasoning, structured extraction
CLI multimodalClaude Code / Codex / Gemini-CLI on your machineBundled with your CLI subscriptionHigh-volume jobs, screenshot-to-code, repeatable batches without per-call billing

Apple gives you privacy and cost-zero on-device processing. Cloud models give you reasoning and handwriting (billed in credits). CLI tools sit in between — local but powered by frontier models.

Operations

Recognize

OperationWhat it doesProviders
Recognize Text (OCR)Extracts printed text. Returns { text, lines, confidence }Apple (default), cloud, CLI
Recognize HandwritingCursive and sketches — frontier models excel hereCloud, CLI
Math → LaTeXConverts an equation image into LaTeX sourceCloud, CLI
Screenshot → CodeTranscribes code from a screenshot back to textCLI (default), cloud
Translate Text in ImageReads text and translates it in one stepCloud, CLI
Chart → DataReverse-engineers a chart back into { series, data }Cloud, CLI
Table → JSONReads a tabular image into rows and headersCloud, CLI

Reason

OperationWhat it doesProviders
Describe ImageProduces a two-sentence captionCloud, CLI
Generate Alt TextAccessibility caption under 100 charactersCloud, CLI
Visual Q&AAnswers a question you type ("What's the total on this receipt?")Cloud, CLI
Extract Structured DataReturns JSON matching a schema you provideCloud, CLI
Classify (Open Labels)Pick the best matching label from your own listCloud, CLI
Count ObjectsCounts instances of whatever you specifyCloud, CLI
Tag / KeywordSuggests 5–10 short tagsCloud, CLI
Content ModerationFlags NSFW / violence / hate-symbols with per-category scoresCloud, CLI
Mock → HTMLTurns a UI sketch into working markupCLI (default), cloud
Mock → React ComponentSame for a React/JSX componentCLI (default), cloud

Modify

OperationWhat it doesProviders
Remove BackgroundStrips the background, returns a transparent imageApple
Auto EnhanceImproves exposure, contrast, sharpnessApple
Apply FilterSepia, mono, noir, vintage, blur, sharpen, vibranceApple

Example workflows

1. Receipt scanner

Drop a photo of a receipt, get a row in your expense spreadsheet.

File ──▶ Vision (Extract, schema: { vendor, date, total, currency })  ──▶ Sheet

Set the schema field to { "vendor": "string", "date": "string", "total": "number", "currency": "string" }. The Vision node returns output.data ready for the Sheet to append.

2. Translate a foreign menu

Tourist mode in three nodes.

File or Webcam ──▶ Vision (Translate Text in Image, target: English) ──▶ Note

The Note shows the translated text. Swap the target language to French, Japanese, anything.

3. Screenshot to React component

Paste a UI mockup, get a JSX component back.

Image (paste screenshot) ──▶ Vision (Mock → React Component, provider: CLI) ──▶ Code (write file)

The Code node drops the generated component into your project. Run it again with a tweaked mockup and iterate.

4. Auto-tag a photo library

Folder (iterate images) ──▶ Vision (Tag / Keyword) ──▶ Sheet (path + tags)

The Tag operation returns labels: [{ label, confidence }]. The Sheet node can flatten that into a comma-separated cell.

5. Free, local receipt OCR at scale

Layer Apple's on-device OCR with cloud reasoning. Apple does the heavy lifting; the cloud only sees plain text.

Folder ──▶ Vision (OCR, provider: Apple) ──▶ Agent (clean + dedupe) ──▶ Sheet

6. Content moderation gate

Upload ──▶ Vision (Content Moderation) ──▶ Condition (block if flagged) ──▶ Publish

Vision returns { flagged, categories: [{ category, score }] }. The Condition node routes based on flagged.

Inputs and outputs

Input

The Vision node looks for an image on its incoming edge, checking — in this order — value, image, __image__, __chart__, chart, imageData, processedImage, output. Any upstream node that publishes one of those fields with a base64 image (or data URL) will feed the Vision node correctly.

Output

Output shape depends on the operation:

  • OCR / Translate / Caption / Q&A{ text, lines?, confidence? }
  • Extract / Chart / Table{ data: <parsed JSON> }
  • Classify / Tag{ labels: [{ label, confidence }] }
  • Count{ count, target }
  • Moderate{ flagged, categories: [{ category, score }] }
  • Mock → HTML/JSX{ code, language }
  • Modify ops{ image, imageData, processedImage }

Output shape is the same regardless of provider for a given operation — switch from Apple OCR to GPT-4o OCR and downstream nodes don't need to change.

Forwarding images downstream

When you wire a Vision node into an Agent node, the Agent automatically picks up the image (if any) and forwards it to its multimodal model as a real attachment — not as base64 text in the prompt. This means a Vision node doing background removal can hand its result straight to an Agent for further analysis with no glue.

Tips

  • Privacy-sensitive workloads → Apple. Faces, IDs, medical images: keep them on-device.
  • Handwriting → cloud or CLI. Apple's OCR is print-leaning; modern VLMs handle cursive much better.
  • High volume → CLI. If you're tagging thousands of photos, the CLI provider amortizes through your existing tool subscription.
  • Structured extraction → write a schema. A two-line schema hint reliably shapes the JSON.
  • Multiple steps are cheap. It's common to chain two Vision steps — for example "auto-crop document" → "OCR the crop."

Related

  • Image Node — feed images into the Vision node
  • Agent Node — receives Vision output (text or images) for further reasoning
  • Sheet Node — common destination for OCR / extract output
  • Apple Intelligence — what's on-device, how to check availability