Pin transformers <5 — comfyui_segment_anything's GroundingDINO needs it

transformers 5.0 removed BertModel.get_head_mask (it was on the legacy 4.x API). comfyui_segment_anything's GroundingDINO bertwarper.py still calls bert_model.get_head_mask in __init__, so first inpaint crashes with AttributeError. Pinned transformers>=4.40,<5 in two places: - Dockerfile: applied AFTER the custom node's requirements.txt install so it wins on a fresh image build. - install-custom-node-deps.sh entrypoint: re-applied at every container start so any future custom-node install (via ComfyUI-Manager or volume clone) that pulls a newer transformers transitively gets pinned back into the working range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
smart_image_gen v0.7.2: chat-DB fallback + diagnostic 'no image' msg
2026-04-19 15:21:21 -05:00 · 2026-04-19 15:07:55 -05:00 · 2026-04-19 14:58:40 -05:00 · 2026-04-19 14:50:34 -05:00 · 2026-04-19 14:46:10 -05:00
5 changed files with 178 additions and 38 deletions
--- a/8
+++ b/8
@@ -57,9 +57,15 @@ RUN git clone --depth 1 https://github.com/ltdrdata/ComfyUI-Manager.git \
 # mask_text parameter). Model weights auto-download on first use into
 # /opt/comfyui/models/{sams,grounding-dino}/ — first inpaint takes ~3 GB of
 # downloads, subsequent runs are instant.
+#
+# Transformers must stay <5: GroundingDINO inside this node calls
+# BertModel.get_head_mask, which transformers 5.0 silently removed. The pin
+# is applied AFTER the requirements install so it overrides anything the
+# upstream requirements.txt would have pulled.
 RUN git clone --depth 1 https://github.com/storyicon/comfyui_segment_anything.git \
        ${COMFYUI_HOME}/custom_nodes/comfyui_segment_anything && \
-    pip install -q -r ${COMFYUI_HOME}/custom_nodes/comfyui_segment_anything/requirements.txt
+    pip install -q -r ${COMFYUI_HOME}/custom_nodes/comfyui_segment_anything/requirements.txt && \
+    pip install -q "transformers>=4.40,<5"

 # Entrypoint wrapper — auto-installs requirements.txt for any custom_node
 # present at startup (covers Manager-installed nodes and nodes cloned
--- a/deployments/ai-stack/openwebui-models/image_studio.json
+++ b/deployments/ai-stack/openwebui-models/image_studio.json
@@ -4,11 +4,13 @@
    "base_model_id": "huihui_ai/qwen3.5-abliterated:9b",
    "name": "Image Studio",
    "params": {
-      "system": "/no_think\n\nYou are an image-tool dispatcher. You do not respond in prose. Every user message MUST result in exactly one tool call.\n\nROUTING:\n- If the user attached an image → call edit_image\n- Otherwise → call generate_image\n\nFire the tool on the FIRST message, with no preamble. Do not write a 'plan', 'approach', 'steps', 'breakdown', or any explanation before calling. Do not ask clarifying questions. Do not say what you are about to do. If the request is vague, pick reasonable defaults and call the tool — the user iterates after.\n\nSTYLES (pick one):\n  photo         photorealistic photo / portrait / cinematic\n  juggernaut    alternate photoreal — sharper, more saturated\n  pony          anime, cartoon, manga, stylised illustration\n  general       catch-all when nothing else fits\n  furry-nai     anthropomorphic, NAI-trained mix\n  furry-noob    anthropomorphic, NoobAI base\n  furry-il      anthropomorphic, Illustrious base (default for any furry/anthro request)\n\nedit_image has TWO MODES — pick based on whether the change is local or global:\n- LOCAL change (\"change the ball to a basketball\", \"add a hat to the dog\", \"remove the bird\", \"recolor the car red\") → set `mask_text` to a brief noun phrase naming the region (\"the ball\", \"the dog\", \"the bird\", \"the car\"). Only that region is repainted; rest stays pixel-perfect.\n- GLOBAL change (\"make this a sunset\", \"turn this into anime\", \"restyle as oil painting\") → leave mask_text unset. The whole image is reimagined.\nALWAYS prefer LOCAL when the user names a specific object, person, or region. GLOBAL is only for whole-image style/lighting transformations.\n\nDenoise:\n- LOCAL (mask_text set): default 1.0. Drop to 0.6–0.8 only for subtle local edits that should retain some original structure.\n- GLOBAL (no mask_text): default 0.7. Use 0.3–0.5 for subtle restyle, 0.85–1.0 for radical reimagining.\n\nPick style for the DESIRED OUTPUT, not the input image.\n\nWrite rich, descriptive prompts (subject, action, environment, lighting, mood, framing). Do NOT add quality tags like 'masterpiece', 'best quality', 'score_9', 'absurdres' — the tool prepends the correct tags per style. Do NOT set sampler, CFG, steps, scheduler — the tool picks them.\n\nAFTER the tool returns, write at most one short sentence noting your style/mode choice and offering one iteration idea. The image is already shown to the user; do not describe it.",
+      "system": "/no_think\n\nYou are an image-tool dispatcher. You do not respond in prose. Every user message MUST result in exactly one tool call.\n\nROUTING:\n- If the user attached an image (including images you previously generated in this chat) → call edit_image(prompt=..., ...)\n- Otherwise → call generate_image(prompt=..., ...)\nBoth tools take `prompt` as the first argument — same name on both. Do NOT invent `edit_instruction`.\n\nFire the tool on the FIRST message, with no preamble. Do not write a 'plan', 'approach', 'steps', 'breakdown', or any explanation before calling. Do not ask clarifying questions. Do not say what you are about to do. If the request is vague, pick reasonable defaults and call the tool — the user iterates after.\n\nSTYLES (pick one):\n  photo         photorealistic photo / portrait / cinematic\n  juggernaut    alternate photoreal — sharper, more saturated\n  pony          anime, cartoon, manga, stylised illustration\n  general       catch-all when nothing else fits\n  furry-nai     anthropomorphic, NAI-trained mix\n  furry-noob    anthropomorphic, NoobAI base\n  furry-il      anthropomorphic, Illustrious base (default for any furry/anthro request)\n\nedit_image has TWO MODES — pick based on whether the change is local or global:\n- LOCAL change (\"change the ball to a basketball\", \"add a hat to the dog\", \"remove the bird\", \"recolor the car red\") → set `mask_text` to a brief noun phrase naming the region (\"the ball\", \"the dog\", \"the bird\", \"the car\"). Only that region is repainted; rest stays pixel-perfect.\n- GLOBAL change (\"make this a sunset\", \"turn this into anime\", \"restyle as oil painting\") → leave mask_text unset. The whole image is reimagined.\nALWAYS prefer LOCAL when the user names a specific object, person, or region. GLOBAL is only for whole-image style/lighting transformations.\n\nDenoise:\n- LOCAL (mask_text set): default 1.0. Drop to 0.6–0.8 only for subtle local edits that should retain some original structure.\n- GLOBAL (no mask_text): default 0.7. Use 0.3–0.5 for subtle restyle, 0.85–1.0 for radical reimagining.\n\nPick style for the DESIRED OUTPUT, not the input image.\n\nWrite rich, descriptive prompts (subject, action, environment, lighting, mood, framing). Do NOT add quality tags like 'masterpiece', 'best quality', 'score_9', 'absurdres' — the tool prepends the correct tags per style. Do NOT set sampler, CFG, steps, scheduler — the tool picks them.\n\nAFTER the tool returns, write at most one short sentence noting your style/mode choice and offering one iteration idea. The image is already shown to the user; do not describe it.",
      "temperature": 0.5,
      "top_p": 0.9,
      "function_calling": "native",
-      "tool_choice": "required"
+      "custom_params": {
+        "tool_choice": "required"
+      }
    },
    "meta": {
      "profile_image_url": "/static/favicon.png",
--- a/deployments/ai-stack/openwebui-models/image_studio.md
+++ b/deployments/ai-stack/openwebui-models/image_studio.md
@@ -65,8 +65,11 @@ You are an image-tool dispatcher. You do not respond in prose. Every
 user message MUST result in exactly one tool call.

 ROUTING:
- If the user attached an image → call edit_image
- Otherwise → call generate_image
+- If the user attached an image (including images you previously
+  generated in this chat) → call edit_image(prompt=..., ...)
+- Otherwise → call generate_image(prompt=..., ...)
+Both tools take `prompt` as the first argument — same name on both.
+Do NOT invent `edit_instruction`.

 Fire the tool on the FIRST message, with no preamble. Do not write a
 'plan', 'approach', 'steps', 'breakdown', or any explanation before
--- a/deployments/ai-stack/openwebui-tools/smart_image_gen.py
+++ b/deployments/ai-stack/openwebui-tools/smart_image_gen.py
@@ -1,7 +1,7 @@
 """
 title: Smart Image Generator & Editor (ComfyUI)
 author: ai-stack
-version: 0.6.0
+version: 0.7.2
 description: Generate or edit images via ComfyUI with automatic SDXL
    checkpoint routing. Two methods — generate_image (txt2img) and
    edit_image (img2img on the user's most recently attached image). The
@@ -34,6 +34,8 @@ from pydantic import BaseModel, Field
 # falls back to emitting a markdown data-URI message.
 try:
    from fastapi import UploadFile
+    from open_webui.models.chats import Chats
+    from open_webui.models.files import Files
    from open_webui.models.users import Users
    from open_webui.routers.files import upload_file_handler

@@ -338,20 +340,93 @@ def _build_img2img(positive: str, negative: str, settings: dict,
    }


+def _file_dict_is_image(f: dict) -> bool:
+    ftype = (f.get("type") or "").lower()
+    fname = (f.get("name") or f.get("filename") or "").lower()
+    return "image" in ftype or fname.endswith((".png", ".jpg", ".jpeg", ".webp"))
+
+
+_FILE_URL_ID_RE = re.compile(r"/(?:api/v1/)?files/([0-9a-fA-F-]{8,})(?:/content)?")
+
+
+def _read_file_dict(f: dict) -> Optional[bytes]:
+    """
+    Try to read raw bytes for one file dict. Tries in order:
+      1. Local filesystem path keys (covers user uploads with `path`).
+      2. Open WebUI's Files.get_file_by_id with f["id"] (covers files
+         the user uploaded via the file API).
+      3. Same lookup with the id parsed out of f["url"] (covers
+         assistant-emitted files where the message attachment is just
+         {"type":"image","url":"/api/v1/files/<uuid>/content"} —
+         no id field, no path field, but the URL has the id).
+    """
+    for path_key in ("path", "filepath", "file_path"):
+        path = f.get(path_key)
+        if path:
+            try:
+                with open(path, "rb") as fh:
+                    return fh.read()
+            except OSError:
+                pass
+
+    candidate_ids = []
+    if f.get("id"):
+        candidate_ids.append(f["id"])
+    url = f.get("url")
+    if url:
+        m = _FILE_URL_ID_RE.search(url)
+        if m:
+            candidate_ids.append(m.group(1))
+
+    if _OPENWEBUI_RUNTIME:
+        for fid in candidate_ids:
+            try:
+                file_model = Files.get_file_by_id(fid)
+                if file_model is None:
+                    continue
+                path = getattr(file_model, "path", None)
+                if not path:
+                    meta = getattr(file_model, "meta", None) or {}
+                    if isinstance(meta, dict):
+                        path = meta.get("path")
+                    else:
+                        path = getattr(meta, "path", None)
+                if path:
+                    try:
+                        with open(path, "rb") as fh:
+                            return fh.read()
+                    except OSError:
+                        pass
+            except Exception:
+                pass
+
+    return None
+
+
 async def _extract_attached_image(
    files: Optional[list],
    messages: Optional[list],
+    metadata: Optional[dict],
    session: aiohttp.ClientSession,
 ) -> Optional[bytes]:
    """
-    Find the most recent image the user attached to the chat. Tries three
-    sources in order: (1) base64 data URIs in `image_url` content blocks
-    of the recent messages (works for vision-capable models), (2) a local
-    filesystem path on the file dict (open-webui stores uploads under
-    /app/backend/data/uploads/), (3) the file's url field, fetched over
-    HTTP. Returns raw image bytes, or None if nothing matched.
+    Find the most recent image in the chat — including images previously
+    emitted by this tool itself. Search order (most recent first):
+
+      1. Inline base64 data URIs in `image_url` content blocks of recent
+         messages (vision-model uploads, paste-from-clipboard).
+      2. Files attached to messages in the conversation, scanned in
+         REVERSE so the newest image wins. This covers two cases:
+           a. Files the user just attached (current user message).
+           b. Files the assistant emitted via prior `generate_image` /
+              `edit_image` calls (attached to assistant messages by the
+              `files` event in _push_image_to_chat).
+      3. The __files__ tool param as a final fallback (some Open WebUI
+         versions pass user uploads here instead of on the message).
+      4. Best-effort URL fetch on any leftover file dict (likely fails
+         on auth-protected endpoints — last resort).
    """
-    # Messages: standard OpenAI image_url content blocks.
+    # 1. Inline data URIs on recent messages.
    for msg in reversed(messages or []):
        content = msg.get("content") if isinstance(msg, dict) else None
        if isinstance(content, list):
@@ -365,27 +440,62 @@ async def _extract_attached_image(
                    except Exception:
                        pass

-    # Files: try local path, then URL.
+    # 2. Files on messages, newest first.
+    for msg in reversed(messages or []):
+        if not isinstance(msg, dict):
+            continue
+        msg_files = msg.get("files")
+        if not isinstance(msg_files, list):
+            continue
+        for f in msg_files:
+            if not isinstance(f, dict) or not _file_dict_is_image(f):
+                continue
+            data = _read_file_dict(f)
+            if data is not None:
+                return data
+
+    # 3. __files__ param (current user upload, sometimes only here).
    for f in files or []:
-        if not isinstance(f, dict):
-            continue
-        ftype = (f.get("type") or "").lower()
-        fname = (f.get("name") or f.get("filename") or "").lower()
-        is_image = "image" in ftype or fname.endswith((".png", ".jpg", ".jpeg", ".webp"))
-        if not is_image:
+        if not isinstance(f, dict) or not _file_dict_is_image(f):
            continue
+        data = _read_file_dict(f)
+        if data is not None:
+            return data

-        for path_key in ("path", "filepath", "file_path"):
-            path = f.get(path_key)
-            if path:
-                try:
-                    with open(path, "rb") as fh:
-                        return fh.read()
-                except OSError:
-                    pass
+    # 4. Pull the chat from the database directly. Open WebUI persists
+    # `files` on every message via the upsert in socket/main.py — so even
+    # if __messages__ doesn't hydrate the assistant-emitted attachments,
+    # the chat record does. This is the strongest fallback.
+    if _OPENWEBUI_RUNTIME and metadata:
+        chat_id = metadata.get("chat_id")
+        if chat_id:
+            try:
+                chat = Chats.get_chat_by_id(chat_id)
+                chat_data = getattr(chat, "chat", None) if chat else None
+                chat_messages = (chat_data or {}).get("messages", []) if isinstance(chat_data, dict) else []
+                for msg in reversed(chat_messages):
+                    if not isinstance(msg, dict):
+                        continue
+                    msg_files = msg.get("files") or []
+                    for f in msg_files:
+                        if not isinstance(f, dict) or not _file_dict_is_image(f):
+                            continue
+                        data = _read_file_dict(f)
+                        if data is not None:
+                            return data
+            except Exception:
+                pass

-        url = f.get("url")
-        if url:
+    # 5. Last-resort URL fetch (no auth — only works for public endpoints).
+    for source in [files or []] + [
+        (msg.get("files") or []) for msg in reversed(messages or []) if isinstance(msg, dict)
+    ]:
+        for f in source:
+            if not isinstance(f, dict) or not _file_dict_is_image(f):
+                continue
+            url = f.get("url")
+            if not url:
+                continue
            full = url if url.startswith("http") else f"http://localhost:8080{url}"
            try:
                async with session.get(full) as resp:
@@ -645,7 +755,7 @@ class Tools:

    async def edit_image(
        self,
-        edit_instruction: str,
+        prompt: str,
        style: Optional[StyleName] = None,
        mask_text: Optional[str] = None,
        denoise: Optional[float] = None,
@@ -689,7 +799,7 @@ class Tools:

        Pick `style` for the DESIRED OUTPUT, not the input image.

-        :param edit_instruction: What the changed area should look like.
+        :param prompt: What the changed area should look like.
            Tool auto-prepends quality tags — don't include those.
        :param style: One of the StyleName values. Omit to auto-detect.
        :param mask_text: Noun phrase describing the region to edit. Set
@@ -700,7 +810,7 @@ class Tools:
        :param seed: 0 to randomize, otherwise specific.
        :return: Markdown image of the result, or an error if no image is attached.
        """
-        chosen = style or _route_style(edit_instruction)
+        chosen = style or _route_style(prompt)
        settings = STYLES.get(chosen)
        if not settings:
            return f"Unknown style '{chosen}'. Available: {', '.join(STYLES.keys())}"
@@ -722,12 +832,24 @@ class Tools:

        async with aiohttp.ClientSession() as session:
            await emit("Looking for attached image…")
-            raw_in = await _extract_attached_image(__files__, __messages__, session)
+            raw_in = await _extract_attached_image(
+                __files__, __messages__, __metadata__, session,
+            )
            if raw_in is None:
+                msgs_with_files = sum(
+                    1 for m in (__messages__ or [])
+                    if isinstance(m, dict) and m.get("files")
+                )
+                chat_id_present = bool((__metadata__ or {}).get("chat_id"))
                return (
-                    "No image found in the chat. Ask the user to attach the "
-                    "image they want edited (paperclip / drag-drop), or call "
-                    "generate_image instead if they want a new image."
+                    "No image found in the chat. Diagnostics: "
+                    f"__files__={len(__files__ or [])}, "
+                    f"__messages__={len(__messages__ or [])} "
+                    f"(of which {msgs_with_files} had a files field), "
+                    f"chat_id_present={chat_id_present}, "
+                    f"openwebui_runtime={_OPENWEBUI_RUNTIME}. "
+                    "Ask the user to attach the image they want edited "
+                    "(paperclip / drag-drop), or call generate_image instead."
                )

            await emit("Uploading source to ComfyUI…")
@@ -741,7 +863,7 @@ class Tools:
                + (f", mask='{mask_text}'" if mask_text else "")
            )

-            positive = f"{settings['prefix']}{edit_instruction}"
+            positive = f"{settings['prefix']}{prompt}"
            negative = settings["negative"]
            if negative_prompt:
                negative = f"{negative}, {negative_prompt}"
--- a/install-custom-node-deps.sh
+++ b/install-custom-node-deps.sh
@@ -18,4 +18,11 @@ if [ -d /opt/comfyui/custom_nodes ]; then
    done
 fi

+# Force-pin known-incompatible packages back into a working range. Some
+# custom nodes bring transformers >=5 transitively, which removes
+# BertModel.get_head_mask and breaks comfyui_segment_anything's
+# GroundingDINO. Run last so it wins over anything the loop above
+# installed.
+pip install -q "transformers>=4.40,<5" || echo "[entrypoint] transformers pin failed — continuing"
+
 exec "$@"
Author	SHA1	Message	Date
William Gill	2cecf77981	Pin transformers <5 — comfyui_segment_anything's GroundingDINO needs it All checks were successful release / Build & Push Docker Image (push) Successful in 1m12s Details transformers 5.0 removed BertModel.get_head_mask (it was on the legacy 4.x API). comfyui_segment_anything's GroundingDINO bertwarper.py still calls bert_model.get_head_mask in __init__, so first inpaint crashes with AttributeError. Pinned transformers>=4.40,<5 in two places: - Dockerfile: applied AFTER the custom node's requirements.txt install so it wins on a fresh image build. - install-custom-node-deps.sh entrypoint: re-applied at every container start so any future custom-node install (via ComfyUI-Manager or volume clone) that pulls a newer transformers transitively gets pinned back into the working range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:21:21 -05:00
William Gill	f26dfbee02	smart_image_gen v0.7.2: chat-DB fallback + diagnostic 'no image' msg If __messages__ doesn't include the assistant's prior file attachments (which is what the screenshot is showing), the new fallback queries the chat by id via Chats.get_chat_by_id and walks every persisted message for files. Open WebUI's socket handler always upserts files onto the assistant message via {'files': files} so this path is authoritative. The 'No image found' return now includes diagnostic counts — __files__, __messages__, messages_with_files, chat_id_present, openwebui_runtime — so subsequent failures actually show what the tool saw instead of being opaque. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 15:07:55 -05:00
William Gill	06433d3815	smart_image_gen v0.7.1: rename edit_image arg + parse file id from URL Two bugs in one screenshot: 1. LLM called edit_image(prompt=..., ...) but the signature was edit_image(edit_instruction=..., ...) — mismatch, missing-arg crash. Renamed the first param to `prompt` so both tools have a matching, predictable name. System prompt updated with an explicit 'do not invent edit_instruction' line for stubborn models. 2. After fix #1, edit_image still couldn't find the prior generated image because Open WebUI assistant-message file attachments only carry {type, url} (no id, no path). _read_file_dict now also greps the file id out of /api/v1/files/<uuid>/content URLs and feeds it to Files.get_file_by_id. Verified pattern matches absolute URLs (https://llm-1.srvno.de/api/v1/files/.../content). System prompt also now says 'including images you previously generated in this chat' to nudge the LLM to pick up assistant outputs as edit candidates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:58:40 -05:00
William Gill	780ce42711	Image Studio: move tool_choice into params.custom_params (correct field) Previous commit put tool_choice at the top level of params. Open WebUI drops that silently — apply_model_params_to_body has a whitelist of mapped param names (temperature, top_p, etc.) and tool_choice isn't on it. The Custom Parameters UI section also only iterates params.custom_params, which is why the value didn't appear there after importing the preset. Correct location is the custom_params sub-dict, where values go through json.loads before being merged into the outgoing chat completion body. 'required' stays a string after the failed json.loads and ends up exactly where the OpenAI / Ollama tools spec expects it. Source: src/lib/components/chat/Settings/Advanced/AdvancedParams.svelte (UI binding) and backend/open_webui/utils/payload.py (serialization). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:50:34 -05:00
William Gill	f6f5690fcd	smart_image_gen v0.7: edit_image finds previously-emitted images Bug: after generate_image surfaced an image via the files event, the next edit_image call returned 'No image found in the chat'. The image was attached to the assistant's message, but _extract_attached_image only scanned the user's __files__ param and image_url content blocks on user messages — it never looked at messages.files for any role. Fix: rewrite extraction to scan messages[].files in reverse for ALL roles, so an assistant-emitted image from a prior tool call is found the same way as a user-attached upload. Use Open WebUI's internal Files.get_file_by_id when the file dict has an id, so we get raw bytes from disk without going through the auth-protected /api/v1/files/{id}/content endpoint. Old path-key and URL-fetch paths kept as fallbacks. Refactored shared helpers _file_dict_is_image and _read_file_dict out of the loop to keep the search logic readable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 14:46:10 -05:00