Example data generation scenarios

The examples below show how you can combine different Syntho generators to get the data that you need for your use case. Note that:

For multi-table setups:

AI synthesis works best on single tables.

If you need to preserve cross-table relationships, replace the AI synthesis examples below with duplicate, or mask/mock/rule-based generators using consistent mapping.

1. Deterministic relations (e.g. male → male name)

Scenario

Hospital EHR table patients with:

  • patient_id

  • gender (M/F/X)

  • first_name, last_name

  • date_of_birth

  • diagnosis_code

  • visit_count

How you combine Syntho methods

  • AI synthesis for:

    • date_of_birth, diagnosis_code, visit_count, and other clinical / behavioral fields, so you preserve realistic age distributions, ICD code combinations, visit patterns, etc., without 1-to-1 linkage to real people.

  • Mockers + calculated columns for deterministic gender–name relation:

    • Use Calculated Column with mockers for first_name and last_name: IF([gender] = 'M', MOCK_FIRST_NAME_MALE, IF([gender] = 'F', MOCK_FIRST_NAME_FEMALE, MOCK_FIRST_NAME_UNISEX))

  • Masking for IDs:

    • Hash patient_id so you can still join across tables but never see the real identifier.

Added value

  • Business rule testing: devs can test logic like “if gender = female, show pregnancy-related questions” on fully synthetic data that always respects the gender–name relation.

  • Realistic UX demos: clinicians and product owners see plausible names matching gender, not nonsense combinations that break trust.

2. Absolute calculations (e.g. revenue – costs = profit)

Scenario

Banking accounts table:

  • account_id

  • revenue

  • costs

  • profit

  • segment, region, etc.

In real systems, profit = revenue – costs and many downstream rules rely on that.

How you combine Syntho methods

  • AI synthesis:

    • Use AI synthesis to generate realistic joint distributions of revenue, costs, segment, region, churn risk, etc. (statistically similar to real portfolio but no 1:1 linkage).

  • Rule-based calculated columns:

    • Overwrite profit with a Calculated Column: [revenue] - [costs]

    • Optionally add extra business logic: IF([revenue] - [costs] < 0, 0, [revenue] - [costs]) for scenarios where profit is stored as 0 instead of negative.

  • Masking / hashing:

    • Hash account_id and mask PII (e.g. IBAN) while preserving format.

Added value

  • Guaranteed accounting consistency: AI handles realistic ranges and correlations; rule-based logic guarantees accounting identities exactly match business rules so regression tests never fail on “impossible” numbers.

  • Stress testing analytics: you can safely share this data with external consultants / vendors; they can calculate margins, perform profitability analysis and validate models, knowing the math holds.

  • Explaining model results: analysts can cross-check their profitability dashboards on synthetic data that behaves exactly like production in terms of formulas.

3. Hierarchical relationship (City > Province > Country)

Scenario

Government benefits system:

  • citizen_id

  • city

  • province

  • country

  • benefit_type

  • benefit_amount

City, province and country must always form valid combinations.

How you combine Syntho methods

  • AI synthesis:

    • Use AI synthesis on the entity table to generate realistic patterns for benefit_type, benefit_amount, city , demographics, etc.

  • Rule-based + mockers for geography hierarchy:

    • Keep a real or mock reference table geo_dim:

      • city, province, country with valid combinations.

    • Use Calculated Columns or lookups to set province and country based on city:

      • e.g.

  • Masking:

    • For privacy, mask or aggregate small cities into “Other city (Province)” to avoid re-identification, while still keeping hierarchical consistency.

Added value

  • Location-based logic behaves correctly (regional eligibility rules, tax brackets, language settings), because city–province–country combos are always valid.

  • Geospatial analytics (heatmaps, per-province dashboards) remain meaningful on synthetic data, enabling external sharing without privacy issues.

  • Complex test scenarios: QA can create rule-based test cases (e.g. cross-border regions, provinces with special rules) without messing up global statistics that AI synthesis preserves.

4. New data creation

Scenario

You’re launching a new SaaS product and don’t have real customers yet, but you want:

  • A realistic multi-tenant database for demos and end-to-end integration testing.

  • Later, once you have data, to augment it with AI-synthesized rows.

How you combine Syntho methods

  1. Phase 1 – no data yet (pure rule-based / mock)

    • Use Mock generators to generate:

      • company_name, user_email, first_name, last_name, etc.

    • Rule-based calculated columns:

      • Column: tenant_id

        • Set generator in column settings to Key generators → Generate. (Creates unique synthetic keys; no formula needed.)

      • Column: user_email

      • Column: role (weighted distribution)

      • Column: signup_date (recent signups)

      • Column: trial_end_date = signup_date + 14 days

      • Column: plan_tier (map by company size proxy)

  2. Phase 2 – once you have initial data:

    • Optionally, train on AI synthesis on the production entity table (per tenant or global) to create more users, sessions, transactions, etc., preserving correlations (usage patterns, feature adoption).

    • Combine with calculated columns to enforce specific business rules (e.g. multi-tenant isolation, SLA tiers).

      • Column: sla_tier (carry forward or rebalance)

      • Column: response_sla_hours (business rule from SLA tier)

      • Column: trial_end_date (ensure consistent with signup_date from mixed sources)

      • Column: environment_flag (separate demo/test tenants)

      • Column: tenant_isolation_guard (hard guard rails)

  3. Add masking when seeding from production:

    • If you seed from a production snapshot, use calculated columns/mask/mock for PII and AI/duplicate/mask/mock/calculated columns for behavioral attributes.

Added value

  • Instantly available realistic test DB even before going live.

  • As you grow, AI synthesis scales up data volume with realistic patterns, while rule-based generators maintain product-specific rules and identities (tenants, subscription tiers).

5. Edge case and rare scenario creation

Scenario

Health insurance claims system where:

  • 99.9% of cases are “normal”

  • But test teams need lots of edge cases:

    • extremely high claim amounts

    • rare combinations of diagnosis codes

    • weird date patterns (backdated claims, overlapping coverage)

    • special product types with exception logic

How you combine Syntho methods

  • AI synthesis:

    • Generate the normal background population of claims based on your source data: realistic volumes, distributions, seasonal patterns, co-occurrence of diagnoses & procedures.

  • Rule-based edge case injection:

    • Use Calculated Columns and Mockers to override or add specific rare scenarios on top of AI synthetic data:

      • Selector for 0.1% edge rows.

      • Extremely high claim amounts

      • Rare diagnosis codes (weighted)

      • Force product/diagnosis combos that trigger rules

      • Weird date patterns: backdated claims

      • Weird date patterns: coverage overlaps or ends before claim

      • Flip eligibility flags or set error codes

  • Masking:

    • If you started from a de-identified copy of real claims, mock/mask remaining identifiers while keeping relationships so that edge cases are still context-rich.

Added value

  • Massive coverage of “what-if” conditions without having to wait for those rare cases in production; key for regression testing and rules engines.

  • Regulatory / audit confidence: you can prove that all critical business rules, alerts and exception flows have been tested against realistic but privacy-safe data.

  • Balanced datasets for QA and ML: AI synthesis covers privacy and keeps realistic distributions, but rule-based injects edge-case oversampling where you need it (e.g. fraud, high-cost outliers).

Last updated

Was this helpful?