Model Vision

Overview

FlowDown supports visual understanding through local MLX or cloud vision models and can auto-generate descriptions, OCR, and QR results for text-only models. Images are resized to ~1024 px and stripped of EXIF by default when Settings → Editor → Compress Image is on.

Image compression setting

Quick Start

  1. Click the beside the composer, or drag/paste files into the input box.
  2. On iOS/iPadOS, choose Take Photo, Photo Library, or Files; on macOS, pick any system-supported file.
  3. Send your question/instruction and wait for the analysis result.

Thumbnails appear above the composer; tap or hover to preview, rename, or remove attachments.

Add image from composer

Configuration Checklist

  • Enable Vision capability: Turn on Vision under Settings → Model (local) or on the cloud model edit page. Forcing Vision on unsupported models will fail.
  • Auxiliary Visual Model: Settings → Inference → Visual Assessment → Auxiliary Visual Model. Used when the active chat model lacks Vision or when you need text fallbacks (descriptions/OCR/QR).
  • Skip Recognition If Possible: Settings → Inference → Visual Assessment, default on. When the chat model has Vision, skipping sends the raw image; disable if you need OCR/backups or may switch to a text-only model.
  • Compress Image: Settings → Editor → Compress Image, on by default to shorten uploads and remove EXIF data.

Model capability configuration Visual inference configuration

How It Works

Dual Paths

  • Vision models: receive the image plus any generated description.
  • Text-only models: FlowDown calls the Auxiliary Visual Model to turn images into text before sending.

When Preprocessing Runs

  • Triggered when an image lacks a manual note and either:
    • the chat model does not support Vision, or
    • Skip Recognition If Possible is turned off.
  • Skipped when the chat model supports Vision and the skip toggle is on.

Processing Steps

  • Generate a scene description
  • Multilingual OCR extraction
  • QR code detection and decoding

Results are written to the attachment’s text representation and persisted (non-ephemeral). Images are always compressed and EXIF-stripped. If no Vision-capable Auxiliary Visual Model is configured and the chat model is text-only, the image will be skipped.

Delivery to the Conversation

  • Vision chat model: sends the raw image plus description.
  • Text-only chat model: sends text representation only; if empty, the attachment is omitted.

Prompting & Verification

  • Be explicit: “Summarize this whiteboard,” “Compare these two screenshots,” or “Convert the table to CSV.”
  • Reference filenames, e.g., “In invoice.png, what is the payment due date?”
  • Use the message menu → Raw Data to confirm [Image Description], [Image Optical Character Recognition Result], and [QRCode Recognition].
  • If results look off, ask the model to re-check attachments or add more context and resend.
  • Delete attachments from the message menu if you don’t want them kept.

Rendering with attachments