- Published on
Building an AI Wrapper with Zero AI Knowledge
- Authors
- Name
- Haikal Tahar
Introduction
i started this project knowing nothing about AI.
just wanted a simple button that changes a brand logo on a product.
then i tried OpenAI, Gemini, and comfyUI. it spiraled into an unexpected journey.
this doc covers what i learned, what failed, and what i plan to do next.
Table of Contents
- 1. Discovery Phase
- 2. Messing with Modalities
- 3. Real Progress
- 4. Retrospective
- 5. What to Improve Next
- 6. Future Plans
1. Discovery Phase
i went into OpenAI’s Model Docs.
found out models like gpt-4o with image gen aren’t public via API.
so i switched to public models like gpt-4.1-mini
. not ideal, but good enough to start. i didn’t find any OpenAI model as good as gpt-4o
image generation.
i started by showing an image on the frontend. then experimented with text2text APIs to describe the product.
built a basic /image2text
endpoint using Gemini 2.0 Flash.
gave it an image. got a caption back.
captions weren’t enough. i needed more control.
read through the Gemini docs.
found out there's a big difference between "image understanding" and "image editing".
i also realized Gemini 2.5 Pro exists only in the chat UI.
API still limited to 2.0 Flash.
tried using Gemini’s safety settings.
set everything to BLOCK_NONE
, but it still filtered results.
not reliable or consistent.
2. Messing with Modalities
i tried different types of models and prompts.
text2text
worked for basic generationimage2text
gave limited captionstext2image
generated images, but not specific enough
what i actually needed was image editing.
i wanted to upload an image and modify a specific region, like changing a logo.
turns out that’s an image-to-image task with prompt conditioning.
Gemini API couldn’t handle this directly.
3. Real Progress
i shared a sample output here:
https://x.com/haikaldev/status/1913178531380556164
the product image was edited with a new logo.
but the model placed a smiley face on the wrong part of the image.
that’s when i learned prompt quality and input cleanup really matter.
clearing the input text sometimes gave better results.
thanks to _zuhairsan for showing me comfyUI and how to use Gemini 2.0 Flash more effectively.
4. Retrospective
what worked
- hands-on experiments
- building reusable backend API wrappers
- clean frontend/backend structure in one monorepo
- realizing that AI image editing isn’t magic → it’s trial, error, and prompt tuning
what didn’t
- i expected Gemini to behave consistently. it didn’t
- i thought safety settings would loosen up content filters. they didn’t
- i assumed the Gemini 2.5 Pro API was available. it wasn’t
- i believed text2img could handle specific edits. it couldn’t
5. What to Improve Next
- try ComfyUI workflows to chain modules: mask → brand overlay → render
- test Gemini 2.0 Flash vs Stable Diffusion XL for editing
- run local inference using tools like InvokeAI or Fooocus
- add UI to draw bounding boxes and turn them into masks for edits
- build fallback logic to retry prompts when outputs are wrong
- make a dynamic prompt builder so users can say “add logo to shirt only, not face”
6. Future Plans
- support uploading multiple images
- let users pick logos or brands from a preset
- build a draggable brand gallery
- add undo and redo support
- let users download the edited image
- log prompt and output history for debugging
- open-source the project once it’s stable