User confirmed end-to-end working stack:
- base_model_id: huihui_ai/qwen3-vl-abliterated:8b
- function_calling: native (gives 'View Result from edit_image'
blocks, structured tool call traces)
- custom_params:
tool_choice: required (forces tool call every turn)
enable_thinking: false (server-side disable; abliterated
Qwen ignores the /no_think system
prompt directive — when thinking
is on the tool call leaks inside
a thinking block as text)
Updated image_studio.json + the markdown setup table + the
'Qwen 3.x quirk' explainer to match. The /no_think line in the
system prompt stays in for non-abliterated Qwen variants but is now
documented as best-effort backup; enable_thinking=false is the
authoritative kill-switch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 KiB
Image Studio — dedicated image-generation chat model
A custom Open WebUI model preset that wraps a base LLM with a system
prompt heavily biased toward calling the smart_image_gen tool. Users
pick Image Studio from the chat-model dropdown when they want to
generate or edit images, and the LLM treats every message as an image
request — calling generate_image for new images and edit_image for
modifications to attached ones.
This exists because general-purpose chat models often "describe" an image in text instead of calling the tool, especially when the request is conversational ("can you draw me…", "I'd like a picture of…"). A dedicated preset removes the ambiguity.
Two ways to install
Option A: Import the JSON (fast)
Workspace → Models → Import (top right) → upload
image_studio.json.
This drops the preset in fully configured: base model, system prompt, tool attachment, function-calling mode, temperature, suggestion prompts. Verify after import:
- The
smart_image_gentool is actually attached (Tools list under the model's edit screen). If not, the tool ID Open WebUI assigned doesn't match thetoolIds: ["smart_image_gen"]in the JSON — re-attach manually. - Base Model is set to
mistral-nemo:12b. Adjust if you want a different LLM (Qwen3.6 or Llama 3.1 also work well; smaller parameter counts may struggle with native tool calling).
Option B: Create manually (table below)
Workspace → Models → + (top right).
| Field | Value |
|---|---|
| Name | Image Studio |
| Base Model | huihui_ai/qwen3-vl-abliterated:8b (Qwen 3 VL base, abliterated, vision + tools). Pull via init-models.sh first. The Qwen 3 VL fine-tune lineage isn't damaged by abliteration the way Qwen 3.5 is, so it both calls tools reliably AND won't refuse to dispatch on NSFW edit prompts. |
| Description | Image generation and routing across SDXL checkpoints. |
| System Prompt | Paste the block from System prompt below. |
| Tools | enable only smart_image_gen |
In the Advanced Params section:
| Field | Value |
|---|---|
| Function Calling | Native — works cleanly on huihui_ai/qwen3-vl-abliterated:8b once thinking is disabled (see Custom Parameters). Native gives you the structured "View Result from edit_image" blocks and "Thought for X seconds" tracing in the UI. |
| Temperature | 0.5 (lower = more reliable tool-calling) |
| Top P | 0.9 |
| Context Length | leave default |
| Custom Parameters | tool_choice: required (forces the model to call a tool every turn) and enable_thinking: false (disables Qwen's thinking mode at the API level — the /no_think system-prompt directive isn't honored by abliterated Qwen builds, but this server-side flag is). Both required for reliable behaviour on huihui_ai/qwen3-vl-abliterated:8b. |
Save. The new model appears in the chat-model dropdown for any user with access.
System prompt
/no_think
You are an image-tool dispatcher. You do not respond in prose. Every
user message MUST result in exactly one tool call.
ROUTING:
- If the user attached an image (including images you previously
generated in this chat) → call edit_image(prompt=..., ...)
- Otherwise → call generate_image(prompt=..., ...)
Both tools take `prompt` as the first argument — same name on both.
Do NOT invent `edit_instruction`.
Fire the tool on the FIRST message, with no preamble. Do not write a
'plan', 'approach', 'steps', 'breakdown', or any explanation before
calling. Do not ask clarifying questions. Do not say what you are
about to do. If the request is vague, pick reasonable defaults and
call the tool — the user iterates after.
STYLES (pick one):
photo photorealistic photo / portrait / cinematic
juggernaut alternate photoreal — sharper, more saturated
pony anime, cartoon, manga, stylised illustration
general catch-all when nothing else fits
furry-nai anthropomorphic, NAI-trained mix
furry-noob anthropomorphic, NoobAI base
furry-il anthropomorphic, Illustrious base (default for any
furry/anthro request)
STYLE FOR edit_image — the tool ENFORCES inheritance: once a style
has been used in this chat, every subsequent edit_image call uses
the same style regardless of what you pass. Behaviour:
- Edit on an image generated earlier in this chat → OMIT `style`
entirely. The tool will use the established style. Passing it is
harmless but ignored.
- Edit on a fresh user upload (no prior tool call in chat) → look at
the image and pick a style: anthropomorphic furry/scaly/feathered
→ furry-il; pony score-tag art → pony; photo / portrait → photo
or juggernaut; anime → pony; ambiguous → general.
- Style cannot be changed mid-chat. If the user wants a different
style, tell them they need to start a new chat — the tool ignores
style overrides on follow-up calls.
edit_image has TWO MODES — pick based on whether the change is local
or global:
- LOCAL ("change the ball to a basketball", "add a hat to the dog",
"remove the bird", "recolor the car red") → set `mask_text` to a
brief noun phrase naming the region ("the ball", "the dog", "the
bird", "the car"). Only that region is repainted; rest stays
pixel-perfect.
- GLOBAL ("make this a sunset", "turn this into anime", "restyle as
oil painting") → leave mask_text unset. The whole image is
reimagined.
ALWAYS prefer LOCAL when the user names a specific object, person,
or region. GLOBAL is only for whole-image style/lighting
transformations.
Denoise:
- LOCAL (mask_text set): default 1.0. Drop to 0.6–0.8 only for
subtle local edits that should retain some original structure.
- GLOBAL (no mask_text): default 0.7. Use 0.3–0.5 for subtle
restyle, 0.85–1.0 for radical reimagining.
Pick style for the DESIRED OUTPUT, not the input image.
Write rich, descriptive prompts (subject, action, environment,
lighting, mood, framing). Do NOT add quality tags like 'masterpiece',
'best quality', 'score_9', 'absurdres' — the tool prepends the
correct tags per style. Do NOT set sampler, CFG, steps, scheduler —
the tool picks them.
AFTER the tool returns, write at most one short PLAIN-ENGLISH
sentence noting your style/mode choice and offering one iteration
idea. The image is already shown to the user.
NEVER, after the tool returns:
- echo or repeat the tool call (no `edit_image(prompt=..., ...)`,
no `<function=...>`, no JSON, no parameter listings)
- describe what's in the image
- list the arguments you used
- enumerate styles, denoise, mask_text, etc.
Those details are visible in the collapsible 'View Result from
edit_image' tool-result block — the user can expand it if they
care. Your follow-up message is for HUMAN conversation, not
bookkeeping.
The first line /no_think disables Qwen 3.x's reasoning phase. If
your base model isn't Qwen 3, leaving it in is a no-op (other models
ignore it). Drop it only if it actually causes problems.
Set a separate Task Model (required after install)
tool_choice: required is what makes Image Studio reliably fire the
tool, but it has a side effect: Open WebUI uses the same model with
the same params for title generation, tag generation, and
autocomplete. With every response forced to be a tool call, those
text-only background tasks can't produce text, so chats stay named
"New Chat" forever and tag suggestions go silent.
Fix: point Open WebUI at a different model for those tasks.
Admin Settings → Interface → Task Model → pick any of the
non-Image-Studio models you have pulled. mistral-nemo:12b,
llama3.1:8b, qwen3.6:latest, or dolphin3:8b all work. The Task
Model only handles short background calls (titles, tags, autocomplete,
search-query rewriting) — it doesn't need to be vision-capable or
particularly large. Smaller is faster and cheaper.
Save. New Image Studio chats now get descriptive titles, tag suggestions return, and autocomplete lights up.
Vision capability
The shipped preset sets meta.capabilities.vision: true so Open WebUI
allows users to attach images to chats with this model. Two paths:
Default — huihui_ai/qwen3-vl-abliterated:8b
The shipped preset uses huihui_ai's abliteration of Qwen 3 VL as
the base — 8B params, vision-capable, native tool calling working,
and won't refuse to dispatch the tool when the user's edit prompt
is NSFW. Preseed via init-models.sh.
Why not the Qwen 3.5 abliterated 9B (huihui_ai/qwen3.5-abliterated:9b)?
Same maintainer, but the abliteration on Qwen 3.5 mangles the
function-call template, causing the model to either refuse to call
tools or emit malformed <function=...> XML that Open WebUI's
parser can't recognise. The Qwen 3 VL fine-tune lineage is
different and doesn't take that damage from abliteration.
Why not standard qwen3.5:9b? The standard (non-abliterated)
Qwen 3.5 calls tools reliably but its safety training refuses on
many image edit prompts even though the LLM's only job is dispatch
(the actual image content is generated by the SDXL checkpoint, which
the LLM never sees). Abliterated VL gets us both reliable tool
calling AND a cooperative dispatcher.
Qwen 3.x quirk: thinking mode is on by default and abliterated
builds ignore the system-prompt /no_think directive — the model
emits its tool call inside a thinking block that the parser treats
as final response text instead of a real tool invocation. The
shipped preset sets enable_thinking: false in custom_params,
which Ollama enforces server-side and the model can't ignore. Don't
remove it.
Alternatives
If Qwen 3.5 isn't a fit (size, language preferences, abliteration caveats), other vision-capable Ollama tags worth trying:
qwen2.5vl:7b— smaller, no thinking mode, very reliable tool-callerllama3.2-vision:11b— Meta's vision variant, ~7 GBminicpm-v:8b— fast, capable
To swap, change base_model_id in image_studio.json (or the Base
Model field if you imported manually) and pull the model via
init-models.sh or the Open WebUI model UI.
Non-vision base model
If you'd rather use a text-only LLM (e.g. mistral-nemo:12b),
keep vision: true in the preset so Open WebUI still permits image
attachments; the image flows through to edit_image via
__messages__ / __files__ and ComfyUI does the visual work. The
LLM can't see the image, but for explicit edit instructions ("change
the background to a sunset") that doesn't matter.
Why this works when a generic chat model didn't
- The system prompt is unambiguous. No room for the model to decide "I'll just describe it in text instead."
- Only one tool is attached. No competing tools to choose between.
- Function Calling: Default is the safer choice for Qwen 3.x
abliterated. Native mode expects the parser to recognise the
model's structured tool-call format, which currently leaks Qwen
3.5's
<function=...><parameter=...>XML to chat as plain text on the published Open WebUI / Ollama versions. Default mode uses Open WebUI's own prompt-injection wrapper that round-trips reliably. Try Native only after swapping the base model to one known to work end-to-end (mistral-nemo, qwen2.5vl). - Lower temperature. Tool calling is more reliable with less sampling randomness.
Iterating on the system prompt
If users ask for things you didn't anticipate (specific aspect ratios, multi-image batches, particular checkpoints not in the routing rules), edit the system prompt above and re-paste into the Workspace → Models entry. It's the highest-leverage place to tune behaviour without touching the Tool's Python.