Building an AI Wrapper with Zero AI Knowledge

Introduction

i started this project knowing nothing about AI.
just wanted a simple button that changes a brand logo on a product.
then i tried OpenAI, Gemini, and comfyUI. it spiraled into an unexpected journey.
this doc covers what i learned, what failed, and what i plan to do next.

1. Discovery Phase
2. Messing with Modalities
3. Real Progress
4. Retrospective
5. What to Improve Next
6. Future Plans

1. Discovery Phase

i went into OpenAI’s Model Docs.
found out models like gpt-4o with image gen aren’t public via API.
so i switched to public models like gpt-4.1-mini. not ideal, but good enough to start. i didn’t find any OpenAI model as good as gpt-4o image generation.

i started by showing an image on the frontend. then experimented with text2text APIs to describe the product.
built a basic /image2text endpoint using Gemini 2.0 Flash.
gave it an image. got a caption back.

captions weren’t enough. i needed more control.
read through the Gemini docs.
found out there's a big difference between "image understanding" and "image editing".

i also realized Gemini 2.5 Pro exists only in the chat UI.
API still limited to 2.0 Flash.

tried using Gemini’s safety settings.
set everything to BLOCK_NONE, but it still filtered results.
not reliable or consistent.

2. Messing with Modalities

i tried different types of models and prompts.

text2text worked for basic generation
image2text gave limited captions
text2image generated images, but not specific enough

what i actually needed was image editing.
i wanted to upload an image and modify a specific region, like changing a logo.
turns out that’s an image-to-image task with prompt conditioning.
Gemini API couldn’t handle this directly.

3. Real Progress

i shared a sample output here:
https://x.com/haikaldev/status/1913178531380556164

the product image was edited with a new logo.
but the model placed a smiley face on the wrong part of the image.
that’s when i learned prompt quality and input cleanup really matter.
clearing the input text sometimes gave better results.

thanks to _zuhairsan for showing me comfyUI and how to use Gemini 2.0 Flash more effectively.

4. Retrospective

what worked

hands-on experiments
building reusable backend API wrappers
clean frontend/backend structure in one monorepo
realizing that AI image editing isn’t magic → it’s trial, error, and prompt tuning

what didn’t

i expected Gemini to behave consistently. it didn’t
i thought safety settings would loosen up content filters. they didn’t
i assumed the Gemini 2.5 Pro API was available. it wasn’t
i believed text2img could handle specific edits. it couldn’t

5. What to Improve Next

try ComfyUI workflows to chain modules: mask → brand overlay → render
test Gemini 2.0 Flash vs Stable Diffusion XL for editing
run local inference using tools like InvokeAI or Fooocus
add UI to draw bounding boxes and turn them into masks for edits
build fallback logic to retry prompts when outputs are wrong
make a dynamic prompt builder so users can say “add logo to shirt only, not face”

6. Future Plans

support uploading multiple images
let users pick logos or brands from a preset
build a draggable brand gallery
add undo and redo support
let users download the edited image
log prompt and output history for debugging
open-source the project once it’s stable