Example data generation scenarios
The examples below show how you can combine different Syntho generators to get the data that you need for your use case. Note that:
1. Deterministic relations (e.g. male → male name)
Scenario
Hospital EHR table patients with:
patient_idgender(M/F/X)first_name,last_namedate_of_birthdiagnosis_codevisit_count
How you combine Syntho methods
AI synthesis for:
date_of_birth,diagnosis_code,visit_count, and other clinical / behavioral fields, so you preserve realistic age distributions, ICD code combinations, visit patterns, etc., without 1-to-1 linkage to real people.
Mockers + calculated columns for deterministic gender–name relation:
Use Calculated Column with mockers for first_name and last_name:
IF([gender] = 'M', MOCK_FIRST_NAME_MALE, IF([gender] = 'F', MOCK_FIRST_NAME_FEMALE, MOCK_FIRST_NAME_UNISEX))
Masking for IDs:
Hash
patient_idso you can still join across tables but never see the real identifier.
Added value
Business rule testing: devs can test logic like “if gender = female, show pregnancy-related questions” on fully synthetic data that always respects the gender–name relation.
Realistic UX demos: clinicians and product owners see plausible names matching gender, not nonsense combinations that break trust.
2. Absolute calculations (e.g. revenue – costs = profit)
Scenario
Banking accounts table:
account_idrevenuecostsprofitsegment,region, etc.
In real systems, profit = revenue – costs and many downstream rules rely on that.
How you combine Syntho methods
AI synthesis:
Use AI synthesis to generate realistic joint distributions of
revenue,costs,segment,region, churn risk, etc. (statistically similar to real portfolio but no 1:1 linkage).
Rule-based calculated columns:
Overwrite
profitwith a Calculated Column:[revenue] - [costs]Optionally add extra business logic:
IF([revenue] - [costs] < 0, 0, [revenue] - [costs])for scenarios where profit is stored as 0 instead of negative.
Masking / hashing:
Hash
account_idand mask PII (e.g.IBAN) while preserving format.
Added value
Guaranteed accounting consistency: AI handles realistic ranges and correlations; rule-based logic guarantees accounting identities exactly match business rules so regression tests never fail on “impossible” numbers.
Stress testing analytics: you can safely share this data with external consultants / vendors; they can calculate margins, perform profitability analysis and validate models, knowing the math holds.
Explaining model results: analysts can cross-check their profitability dashboards on synthetic data that behaves exactly like production in terms of formulas.
3. Hierarchical relationship (City > Province > Country)
Scenario
Government benefits system:
citizen_idcityprovincecountrybenefit_typebenefit_amount
City, province and country must always form valid combinations.
How you combine Syntho methods
AI synthesis:
Use AI synthesis on the entity table to generate realistic patterns for
benefit_type,benefit_amount,city, demographics, etc.
Rule-based + mockers for geography hierarchy:
Keep a real or mock reference table
geo_dim:city,province,countrywith valid combinations.
Use Calculated Columns or lookups to set
provinceandcountrybased oncity:e.g.
Masking:
For privacy, mask or aggregate small cities into “Other city (Province)” to avoid re-identification, while still keeping hierarchical consistency.
Added value
Location-based logic behaves correctly (regional eligibility rules, tax brackets, language settings), because city–province–country combos are always valid.
Geospatial analytics (heatmaps, per-province dashboards) remain meaningful on synthetic data, enabling external sharing without privacy issues.
Complex test scenarios: QA can create rule-based test cases (e.g. cross-border regions, provinces with special rules) without messing up global statistics that AI synthesis preserves.
4. New data creation
Scenario
You’re launching a new SaaS product and don’t have real customers yet, but you want:
A realistic multi-tenant database for demos and end-to-end integration testing.
Later, once you have data, to augment it with AI-synthesized rows.
How you combine Syntho methods
Phase 1 – no data yet (pure rule-based / mock)
Use Mock generators to generate:
company_name,user_email,first_name,last_name, etc.
Rule-based calculated columns:
Column: tenant_id
Set generator in column settings to Key generators → Generate. (Creates unique synthetic keys; no formula needed.)
Column: user_email
Column: role (weighted distribution)
Column: signup_date (recent signups)
Column: trial_end_date = signup_date + 14 days
Column: plan_tier (map by company size proxy)
Phase 2 – once you have initial data:
Optionally, train on AI synthesis on the production entity table (per tenant or global) to create more users, sessions, transactions, etc., preserving correlations (usage patterns, feature adoption).
Combine with calculated columns to enforce specific business rules (e.g. multi-tenant isolation, SLA tiers).
Column: sla_tier (carry forward or rebalance)
Column: response_sla_hours (business rule from SLA tier)
Column: trial_end_date (ensure consistent with signup_date from mixed sources)
Column: environment_flag (separate demo/test tenants)
Column: tenant_isolation_guard (hard guard rails)
Add masking when seeding from production:
If you seed from a production snapshot, use calculated columns/mask/mock for PII and AI/duplicate/mask/mock/calculated columns for behavioral attributes.
Added value
Instantly available realistic test DB even before going live.
As you grow, AI synthesis scales up data volume with realistic patterns, while rule-based generators maintain product-specific rules and identities (tenants, subscription tiers).
5. Edge case and rare scenario creation
Scenario
Health insurance claims system where:
99.9% of cases are “normal”
But test teams need lots of edge cases:
extremely high claim amounts
rare combinations of diagnosis codes
weird date patterns (backdated claims, overlapping coverage)
special product types with exception logic
How you combine Syntho methods
AI synthesis:
Generate the normal background population of claims based on your source data: realistic volumes, distributions, seasonal patterns, co-occurrence of diagnoses & procedures.
Rule-based edge case injection:
Use Calculated Columns and Mockers to override or add specific rare scenarios on top of AI synthetic data:
Selector for 0.1% edge rows.
Extremely high claim amounts
Rare diagnosis codes (weighted)
Force product/diagnosis combos that trigger rules
Weird date patterns: backdated claims
Weird date patterns: coverage overlaps or ends before claim
Flip eligibility flags or set error codes
Masking:
If you started from a de-identified copy of real claims, mock/mask remaining identifiers while keeping relationships so that edge cases are still context-rich.
Added value
Massive coverage of “what-if” conditions without having to wait for those rare cases in production; key for regression testing and rules engines.
Regulatory / audit confidence: you can prove that all critical business rules, alerts and exception flows have been tested against realistic but privacy-safe data.
Balanced datasets for QA and ML: AI synthesis covers privacy and keeps realistic distributions, but rule-based injects edge-case oversampling where you need it (e.g. fraud, high-cost outliers).
Last updated
Was this helpful?

