Ingesting 50K+ Products with Golang: When Suppliers Break Your Schema

The Problem
Understanding the Source
The Real Challenge
How I Built It
What Worked
Lessons Learned

The Problem

i started thinking this would be simple: fetch products, clean them, insert them. done.

then i realized different suppliers have wildly different data structures. PCNA formats data one way, S&S Activewear another, HIT yet another.

50,000+ products, multiple data sources, all conflicting schemas.

Understanding the Source

psrestful.com exposes 4 API endpoints:

Sellables - list of product IDs
Media - images and decoration info
Pricing & Configuration - prices, FOB points, currencies
Inventory - stock levels by warehouse

the problem: you need all 4 endpoints for each product. that's 200K+ API calls for 50K products.

and each supplier returns the data differently:

PCNA & HIT - decorations in an array, clean FOB structure
S&S Activewear - decorations flattened, FOB format different, location names abbreviated

The Real Challenge

i couldn't just map one schema to another. i needed to:

fetch data from 4 separate endpoints (can't do it in one call)
stitch them together into a single product record
normalize across supplier differences (different field names, structures)
batch insert 500+ products at a time
handle 50K products without hanging the API (concurrency limits)
see what's actually failing (observability)

the first attempt: sequential API calls. 12+ hours for 50K products. killed it after an hour.

How I Built It

Parallel API Fetches

instead of:

fetch media → wait → fetch pricing → wait → fetch inventory

i did:

fetch media, pricing, inventory in parallel

per product, this cut time by 3x.

10-Worker Concurrency Pool

used goroutines to process multiple products at once. with 50K products and 4 API endpoints per product, you're looking at 200K+ API calls. doing it sequentially would take 12+ hours.

10 workers balanced the load:

fewer = too slow, hit rate limits
more = API rate limiting (429 errors)
10 = sweet spot - good throughput, API doesn't complain

buffered channels prevented workers from blocking each other.

note on delta sync: right now it's a full ingest every time (all 50K products). in the future, delta sync will only fetch products that changed since last run. fewer products = fewer workers needed = faster overall. but full ingest is simpler to reason about.

Supplier-Specific Normalizers

instead of complex conditionals everywhere:

normalizer := getNormalizer(supplierCode) // PCNA, HIT, SSA
normalized := normalizer.Normalize(rawData)

each supplier gets its own parser. keeps logic clean.

Batch Inserts of 500

tested batch sizes:

1000: too many rows per transaction, lock contention
500: goldilocks - fast enough, memory efficient

each product has ~10+ related records (media, pricing, inventory), so 500 products = ~5000+ rows per batch.

The Observability Problem

fastest pipeline in the world is useless if you can't see what's breaking.

ingested 50K products. everything seemed fine. then noticed products without pricing or inventory. how many? which ones? why?

no idea. the pipeline just swallowed failures.

added:

structured logging at every step (fetch, normalize, insert)
metrics by step - where are products failing?
checkpoints after each batch (verify data in database)

now i can see:

if media endpoint is returning errors
which supplier has issues
exactly which products failed and why

made debugging 100x faster.

What Worked

parallel API fetches per product - 3x faster than sequential
supplier-specific normalizers - isolated complexity
10-worker pool - hit API rate limits sweet spot
batch size 500 - balance speed and transaction load
structured logging everywhere - caught failures immediately
checkpoints after insert - verified data got in

Lessons Learned

don't do sequential API calls. with 50K products and 4 endpoints, you're looking at 12+ hours. parallelize.

test worker counts. more workers isn't always faster. API rate limits will hurt you. find the sweet spot (for us: 10).

batch size matters. if each record has related data, smaller batches might be faster than you think.

without observability, you're flying blind. structured logging + metrics + checkpoints turned "something failed" into "product ABC failed at media fetch due to 500 error."

supplier-specific logic should be isolated. don't scatter if supplier == "SSA" checks everywhere. use a normalizer interface.

implement retries with exponential backoff
delta sync (right now it's full ingest every time)
per-supplier rate limit handling
data quality dashboard