Synthetic data that behaves: a practical guide to generating realistic health care-like data without violating privacy

by
0 comments
Synthetic data that behaves: a practical guide to generating realistic health care-like data without violating privacy

Author(s): Abhishek Yadav

Originally published on Towards AI.

A practical guide to building synthetic data that looks, feels and behaves like the real world without risk to privacy

photo by Luke Chesser But unsplash

hHealth care organizations are sitting on a treasure trove of data, whether it’s appointments, lab results, care visits, billing patterns, social determinants. Yet regulations designed to protect patient privacy often make it nearly impossible for analysts and data scientists to experiment independently.

And this is a problem.

Because innovation does not start in production systems.
it begins game Such as Try out ideas, create prototypes and run experiments without the fear of leaking sensitive data.

it is right here synthetic data Data becomes one of the most powerful tools in the scientist’s toolkit.

In this guide, I’ll show you a practical, no-nonsense approach to producing Synthetic data like healthcare that behaves like the real thing Such as preserving size, distribution, trends, correlation, without exposing even a single patient.

No GAN.
No deep learning.
No delicate black-box models.

Just a clear, transparent way you can explain it to your compliance team in less than two minutes.

Why synthetic data matters (more than ever)

Most organizations face three common obstacles:

1. Privacy laws delay or prevent innovation.
HIPAA, GDPR, and internal policies often prevent teams from sharing even anonymized datasets internally.

2. Analysts cannot experiment rapidly.
Requesting access to PHI can take weeks or months, slowing down initial ideas.

3. Teams need realistic datasets to prototype and demo.
Working demos, model validation, and vendor evaluation all need data… but don’t Real Patient data.

Synthetic data solves all three problems when done correctly:

✔ Treats like real data
✔ It does not contain any patient information
✔ Can be shared freely between teams
✔ Dramatically speeds up model development

But not all synthetic data is good synthetic data

If your synthetic data looks like uniform noise, you created it Fake no data useful Synthetic data.

Good synthetic data has three properties:

1. Statistical fidelity
The distribution, seasonal trends, missing patterns, and class proportions resemble real data.

2. relationship loyalty
Correlations persist, for example, older age is related to higher visit frequency, chronic conditions are related to higher readmission risk, etc.

3. Behavioral integrity
size Data looks accurate over time such as wait times, appointment lead times, cancellations, etc.

Our goal is not to have a perfect replication of the real dataset.
we aim to practical realism.

Approach: A Transparent 3-Layer Synthetic Generator

Deep generative models (GAN, VAE) can create beautiful synthetic datasets, but they are:

  • hard to tune
  • Risk-prone (they can leak patterns from small datasets)
  • A black box for compliance teams

Instead, I use simple and explainable three layer generator This works for most healthcare operations datasets.

Layer 1: Distributions that match reality

Each variable gets a distribution based on real-world behavior.

Example:

This layer ensures your dataset looks real.

Layer 2: Respecting Correlations

Example relations we preserve:

  • Older patients → higher seizure frequency
  • Chronic conditions → longer appointment times
  • New patients → higher no-show probability
  • Few features → High cancellation rates

We estimate correlations using:

  • Spearman rank correlation for heterogeneous health care variables
  • A copula model for generating correlated samples
  • Logical rules are layered on top (for example, chronic_condition > 3 → high_visit_risk)

now dataset behaves Like the real world.

Layer 3: Realistic Time Behavior

Healthcare data is not static. Its pattern and seasonality are:

  • hourly cycle (for example, AM peak, lunch dip, PM ramp)
  • weekly pattern (Monday load, weekend effect)
  • Season (Flu season extended, cancellation in December)
  • simple, explainable curves And holiday flags, no black boxes

We connect these through simple, explainable curves, no AI magic required.

A Minimal Python Generator You Can Trust

Here’s a clean and readable version you can publish:

import numpy as np
import pandas as pd
from scipy.stats import gaussian_kde

np.random.seed(42)

N = 5000 # dataset size

# ---- Layer 1: Distributions ----
ages = np.random.lognormal(mean=3.6, sigma=0.4, size=N).astype(int)
ages = np.clip(ages, 18, 95)

lead_time = np.random.lognormal(mean=2.3, sigma=0.5, size=N).astype(int)

provider_type = np.random.choice(
('Primary Care', 'Cardiology', 'Dermatology', 'Neurology'),
size=N,
p=(0.55, 0.20, 0.15, 0.10)
)

no_show_prob = np.random.beta(2, 10, size=N)

# ---- Layer 2: Relationships ----
visit_frequency = np.round((ages / 30) + np.random.normal(0, 1, N)).clip(0)
chronic_conditions = np.random.poisson(lam=1.2, size=N)

# Rule-based adjustment
no_show_flag = (np.random.rand(N) < (no_show_prob + chronic_conditions*0.02)).astype(int)

# ---- Layer 3: Time Behavior ----
days = pd.date_range(start="2023-01-01", periods=N, freq="H")
seasonality = np.sin(np.linspace(0, 6*np.pi, N)) # approximate monthly cycle

wait_time = (lead_time * (1 + 0.3*seasonality)).clip(0).astype(int)

# ---- Output ----
df = pd.DataFrame({
'age': ages,
'lead_time_days': lead_time,
'provider': provider_type,
'chronic_conditions': chronic_conditions,
'visit_frequency': visit_frequency,
'no_show': no_show_flag,
'appointment_date': days,
'adjusted_wait_time': wait_time
})

df.head()

it gives you:

  • realistic age curve
  • realistic lead time
  • Realistic No-Show Pattern
  • seasonal behavior
  • clean correlation

a dataset that behavesWithout ever touching PHI.

How close should the synthetic data be?

The rule of thumb I use when working with compliance and clinical partners:

“The data should behave like the real world, not mimic a real patient.”

You can verify this using three checks:

1. Distribution plot

Compare genuine vs synthetic at a high level.

2. Correlation Matrix

Compare Spearman correlation matrices; Directions must match, magnitudes must be plausible (not identical to a specific dataset).

3. Downstream Model Accuracy

Train a simple model (for example, logistic regression with no shows). Performance trends on synthetics should be closer to the real world (e.g., what matters) without similar metrics.

Quick Visualization Code:

import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(12,4))
plt.subplot(1,3,1); plt.hist(df('age'), bins=30, color='#4c78a8'); plt.title('Age')
plt.subplot(1,3,2); plt.hist(df('lead_time_days'), bins=30, color='#f58518'); plt.title('Lead Time (days)')
plt.subplot(1,3,3); plt.hist(df('chronic_conditions'), bins=15, color='#54a24b'); plt.title('Chronic Conditions')
plt.tight_layout(); plt.show()

# Spearman correlation heatmap
cols = ('age','lead_time_days','chronic_conditions','visit_frequency')
corr = df(cols).corr(method='spearman')
plt.figure(figsize=(5,4))
plt.imshow(corr, cmap='coolwarm', vmin=-1, vmax=1)
plt.colorbar(); plt.xticks(range(len(cols)), cols, rotation=45); plt.yticks(range(len(cols)), cols)
plt.title('Spearman Correlation'); plt.tight_layout(); plt.show()

Distribution plots for age, lead time and chronic conditions were generated using Python (Matplotlib)
Spearman correlation heatmap generated using Python (matplotlib)

A Case Study: Appointment Optimization

A healthcare organization wanted to find out the following:

  • reduce no-shows
  • Optimization of scheduling slots
  • Measuring wait time constraints

But analysts did not have access to operational data until approval was granted.

By creating a synthetic dataset like the one above, they can:

  • test feature engineering
  • create initial prototype
  • Try No-Show Prediction Model
  • Experiment with scheduling simulation

When actual data access was allowed weeks later, 80% of the pipeline was already built. Team went from idea to deployment 13 daysInstead of 2-3 months.

Synthetic data did not replace real data.
it’s unlocked pace.

Real Benefits: Freedom to Experiment

Synthetic data gives teams:

✔ Freedom to experiment with ideas
✔ Freedom to fail safely
✔ Freedom to share datasets between teams
✔ Freedom to prototype without waiting for permission
✔Freedom to do something new faster than bureaucracy

In an industry where delays can be measured in lives, freedom matters.

final thoughts

There is no need to hold healthcare data hostage until access approval is complete. With thoughtful synthetic generation, teams can responsibly collaborate, create, iterate, and innovate.

Not all synthetic data is equal, but when done transparently and with behavioral realism, it becomes one of the most powerful accelerators for healthcare analytics.

Published via Towards AI

Related Articles

Leave a Comment