AI Product Management Guru

AI Product Management Guru

Synthetic Data 101: When Real Data Isn't Enough, Isn't Safe, or Doesn't Exist Yet

What it is, when you need it, and the workflow to actually generate and validate it.

Shaili Guru's avatar
Shaili Guru
May 24, 2026
∙ Paid

I first heard of synthetic data in my Nike working days, and I also built synthetic avatar data with my team. Synthetic data came up again recently in my spring AI Product Foundations class at the University of Washington: a team of students working on a B2B project hit the data wall. Their product spec was solid. Their hypothetical customer were lined up. But the customers they could pilot with had only a few hundred data points each, not the thousands their model would need to build.

Their proposed fix: take whatever real data the businesses gave them, generate synthetic data on top to fill the gaps, and train the model on the combined set. Use synthetic data as a bridge from “not enough” to “enough to start.”

I told them that was exactly the right instinct. I also told them they were stepping into a bigger conversation than they realized.

Here’s what I dug into so they (and you) could think about it clearly, and actually act on it.

What you’ll learn (12-minute read):

  • What synthetic data actually is (and what it isn’t)

  • The three situations where it becomes your best option

  • How it gets generated, from rules to foundation models

  • How teams actually generate synthetic data today (with a few tools to look into)

  • Where it’s showing up in real products

  • The risks you need to design around, and how to validate against them

  • A practical framework for deciding if your product needs it

What Synthetic Data Actually Is

Synthetic data is artificially generated data that mimics the statistical properties of real data without intentionally including real records. (More on the “intentionally” qualifier later. Leakage is a real risk.)

That sentence is accurate. It’s also not very helpful.

Here’s what made it click for me: synthetic data is a stunt double. It looks enough like the real actor to film the dangerous scenes, but nobody’s actual body is at risk. The audience can’t tell the difference in the final cut. But if the stunt double doesn’t move like the real actor, the scene falls apart.

The same trade-off applies. Synthetic data is useful exactly to the degree that it faithfully represents the patterns in your real data. When it does, it’s genuinely powerful. When it doesn’t, you’ve trained your model on fiction.

Three Situations Where Synthetic Data Becomes Your Best Option

Not every team needs synthetic data. But these three situations keep showing up, and when you’re in one of them, real data alone won’t get you there.

1. You Can’t Use Real Data

Healthcare. Finance. Education. Any domain where the data you need is personally identifiable, legally protected, or ethically sensitive. HIPAA, GDPR, FERPA. The acronyms change, but the constraint is the same: you can’t just hand real patient records to your ML pipeline.

Healthcare AI is the textbook case. The Synthea project, an open-source generator of synthetic patient records, exists for exactly this reason. Real records can’t be shared across institutions; synthetic records can. Teams that need to test model behavior on edge cases (rare diseases, complex polypharmacy, demographic representation) generate test populations rather than wait for permission to access real patients’ data. Synthea is widely used in research and benchmarking.

The pattern repeats outside healthcare. Anywhere data is regulated, synthetic data is how teams prototype and test.

2. You Don’t Have Enough Real Data

This is the rare-event problem. Your model needs thousands of examples of something rare. Reality has given you a hundred. This is called class imbalance, and it’s one of the most common problems in applied ML.

The numbers are brutal. In fraud detection, legitimate transactions might outnumber fraudulent ones 10,000 to 1. In manufacturing defect detection, you might have millions of “good” images and fifty “defect” images. In autonomous driving, the scenarios that matter most (near-misses, edge cases, unusual weather conditions) are by definition rare.

You can’t train a model to catch something it’s barely seen. Synthetic data fills the gap by generating thousands of realistic examples of the rare class. In practice, teams often combine techniques such as SMOTE (Synthetic Minority Oversampling Technique) style oversampling or GAN-based tabular synthesis (e.g., CTGAN) to expand the representation of rare classes. More on these in the methods section below.

3. Your Product Doesn’t Exist Yet

This is the cold-start problem. You’re building a product for a market that doesn’t have existing data, because the product itself is new.

A robotics team building a warehouse navigation system for a new facility. An autonomous vehicle team preparing for road conditions in a country they haven’t launched in yet. An AI assistant being designed for a workflow that no one has done before.

And, at a smaller scale, the UW students from the opening of this post: a B2B product team that can pilot with a handful of customers but needs thousands of records to train. Same problem, different size. The product doesn’t exist yet, so the data doesn’t either.

You can’t collect user interaction data for a product that hasn’t shipped. But you need that data to build the product that ships. Synthetic data breaks the chicken-and-egg problem.

I touched on this in my World Models 101 post. The sim-to-real pattern: train heavily in simulation, then fine-tune on smaller real-world datasets once you deploy. Synthetic data is what makes that pattern possible.

That’s the “what” and the “when”: what synthetic data is, and the three situations where you actually need it.

The rest of this post is “the how” and “the decide”. The operational half you run on Monday morning:

  • How synthetic data actually gets made: the four generation methods, each with a PM checklist for when to use it

  • How teams generate it today: three concrete workflows, including a no-code LLM workflow you can run yourself, plus the vendor tools worth a look

  • Where it’s showing up in real products: autonomous vehicles, healthcare, finance, evals, and LLM training

  • The four risks, each with a validation check: how to catch distribution mismatch, artifacts, privacy leakage, and bias amplification before they ship

  • A decision framework: whether your product needs synthetic data at all, and which method fits your risk tolerance

  • Three interview questions: how to handle synthetic data when it comes up in an AI PM interview

User's avatar

Continue reading this post for free, courtesy of Shaili Guru.

Or purchase a paid subscription.
© 2026 Shaili Guru · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture