Example data generation scenarios

The examples below show how you can combine different Syntho generators to get the data that you need for your use case. Note that:

For multi-table setups:

AI synthesis works best on single tables.

If you need to preserve cross-table relationships, replace the AI synthesis examples below with duplicate, or mask/mock/rule-based generators using consistent mapping.

1. Deterministic relations (e.g. male → male name)

Scenario

Hospital EHR table patients with:

patient_id
gender (M/F/X)
first_name, last_name
date_of_birth
diagnosis_code
visit_count

How you combine Syntho methods

AI synthesis for:
- date_of_birth, diagnosis_code, visit_count, and other clinical / behavioral fields, so you preserve realistic age distributions, ICD code combinations, visit patterns, etc., without 1-to-1 linkage to real people.
Mockers + calculated columns for deterministic gender–name relation:
- Use Calculated Column with mockers for first_name and last_name: IF([gender] = 'M', MOCK_FIRST_NAME_MALE, IF([gender] = 'F', MOCK_FIRST_NAME_FEMALE, MOCK_FIRST_NAME_UNISEX))
Masking for IDs:
- Hash patient_id so you can still join across tables but never see the real identifier.

Added value

Business rule testing: devs can test logic like “if gender = female, show pregnancy-related questions” on fully synthetic data that always respects the gender–name relation.
Realistic UX demos: clinicians and product owners see plausible names matching gender, not nonsense combinations that break trust.

2. Absolute calculations (e.g. revenue – costs = profit)

Scenario

Banking accounts table:

account_id
revenue
costs
profit
segment, region, etc.

In real systems, profit = revenue – costs and many downstream rules rely on that.

How you combine Syntho methods

AI synthesis:
- Use AI synthesis to generate realistic joint distributions of revenue, costs, segment, region, churn risk, etc. (statistically similar to real portfolio but no 1:1 linkage).
Rule-based calculated columns:
- Overwrite profit with a Calculated Column: [revenue] - [costs]
- Optionally add extra business logic: IF([revenue] - [costs] < 0, 0, [revenue] - [costs]) for scenarios where profit is stored as 0 instead of negative.
Masking / hashing:
- Hash account_id and mask PII (e.g. IBAN) while preserving format.

Added value

Guaranteed accounting consistency: AI handles realistic ranges and correlations; rule-based logic guarantees accounting identities exactly match business rules so regression tests never fail on “impossible” numbers.
Stress testing analytics: you can safely share this data with external consultants / vendors; they can calculate margins, perform profitability analysis and validate models, knowing the math holds.
Explaining model results: analysts can cross-check their profitability dashboards on synthetic data that behaves exactly like production in terms of formulas.

3. Hierarchical relationship (City > Province > Country)

Scenario

Government benefits system:

citizen_id
city
province
country
benefit_type
benefit_amount

City, province and country must always form valid combinations.

How you combine Syntho methods

AI synthesis:
- Use AI synthesis on the entity table to generate realistic patterns for benefit_type, benefit_amount, city , demographics, etc.
Rule-based + mockers for geography hierarchy:
- Keep a real or mock reference table geo_dim:
  - city, province, country with valid combinations.
- Use Calculated Columns or lookups to set province and country based on city:
  - e.g.
    SWITCH(UPPER(TRIM([CITY])), "TORONTO", "ONTARIO", "MONTREAL", "QUEBEC", ...)
Masking:
- For privacy, mask or aggregate small cities into “Other city (Province)” to avoid re-identification, while still keeping hierarchical consistency.

Added value

Location-based logic behaves correctly (regional eligibility rules, tax brackets, language settings), because city–province–country combos are always valid.
Geospatial analytics (heatmaps, per-province dashboards) remain meaningful on synthetic data, enabling external sharing without privacy issues.
Complex test scenarios: QA can create rule-based test cases (e.g. cross-border regions, provinces with special rules) without messing up global statistics that AI synthesis preserves.

4. New data creation

Scenario

You’re launching a new SaaS product and don’t have real customers yet, but you want:

A realistic multi-tenant database for demos and end-to-end integration testing.
Later, once you have data, to augment it with AI-synthesized rows.

How you combine Syntho methods

Phase 1 – no data yet (pure rule-based / mock)

Use Mock generators to generate:
- company_name, user_email, first_name, last_name, etc.

Rule-based calculated columns:

Column: tenant_id
- Set generator in column settings to Key generators → Generate. (Creates unique synthetic keys; no formula needed.)

Column: user_email

// Company email per user
LOWER(CONCATENATE([FIRST_NAME], ".", [LAST_NAME], "@", MOCK_FREE_EMAIL_DOMAIN_0))

Column: role (weighted distribution)

// ~10% OWNER, 20% ADMIN, 70% MEMBER
IFS(
  RANDBETWEEN(1,100) <= 10,  "OWNER",
  RANDBETWEEN(1,100) <= 30,  "ADMIN",
  TRUE,                      "MEMBER"
)

Column: signup_date (recent signups)

DATEADD(TODAY(), -RANDBETWEEN(0, 90), "day")

Column: trial_end_date = signup_date + 14 days
```
DATEADD([signup_date], 14, "day")
```

Column: plan_tier (map by company size proxy)

// Size proxy via random users per tenant influences tier
SWITCH(TRUE,
  RANDBETWEEN(1,100) <= 10, "ENTERPRISE",
  RANDBETWEEN(1,100) <= 40, "PRO",
  "FREE"
)

Phase 2 – once you have initial data:
- Optionally, train on AI synthesis on the production entity table (per tenant or global) to create more users, sessions, transactions, etc., preserving correlations (usage patterns, feature adoption).
- Combine with calculated columns to enforce specific business rules (e.g. multi-tenant isolation, SLA tiers).
  - Column: sla_tier (carry forward or rebalance)
    // Keep existing if present, else assign IF(ISNULL([sla_tier]), IFS(RANDBETWEEN(1,100)<=10,"GOLD", RANDBETWEEN(1,100)<=40,"SILVER", TRUE,"BRONZE"), [sla_tier] )
  - Column: response_sla_hours (business rule from SLA tier)
    SWITCH(UPPER(TRIM([sla_tier])), "GOLD", 4, "SILVER", 8, "BRONZE", 24, 24 )
  - Column: trial_end_date (ensure consistent with signup_date from mixed sources)
    IF(ISNULL([trial_end_date]), DATEADD([signup_date], 14, "day"), [trial_end_date] )
  - Column: environment_flag (separate demo/test tenants)
    // Oversample demo tenants for testing flows IFS( RANDBETWEEN(1,100) <= 20, "DEMO", TRUE, "PROD" )
  - Column: tenant_isolation_guard (hard guard rails)
    // Example: block cross-tenant sharing flag in demo data IF([environment_flag]="DEMO", FALSE, [sharing_enabled])
Add masking when seeding from production:
- If you seed from a production snapshot, use calculated columns/mask/mock for PII and AI/duplicate/mask/mock/calculated columns for behavioral attributes.

Added value

Instantly available realistic test DB even before going live.
As you grow, AI synthesis scales up data volume with realistic patterns, while rule-based generators maintain product-specific rules and identities (tenants, subscription tiers).

5. Edge case and rare scenario creation

Scenario

Health insurance claims system where:

99.9% of cases are “normal”
But test teams need lots of edge cases:
- extremely high claim amounts
- rare combinations of diagnosis codes
- weird date patterns (backdated claims, overlapping coverage)
- special product types with exception logic

How you combine Syntho methods

AI synthesis:
- Generate the normal background population of claims based on your source data: realistic volumes, distributions, seasonal patterns, co-occurrence of diagnoses & procedures.

Rule-based edge case injection:

Use Calculated Columns and Mockers to override or add specific rare scenarios on top of AI synthetic data:

Selector for 0.1% edge rows.

// New column: EDGE_FLAG (TRUE for ~0.1% of rows)
RAND() < 0.001

Extremely high claim amounts

// Column: claim_amount
IF([EDGE_FLAG],
   RANDBETWEEN(100000, 500000),
   [claim_amount]
)

Rare diagnosis codes (weighted)

// Column: diagnosis_code
IFS(
  AND([EDGE_FLAG], RAND() < 0.33),      "E75.5",           // lysosomal storage d/o (example)
  AND([EDGE_FLAG], RAND() < 0.66),      "G12.2",           // motor neuron d/o (example)
  AND([EDGE_FLAG], TRUE),               "D42.0",           // rare tumor (example)
  TRUE,                                 [diagnosis_code]   // default keep
)

Force product/diagnosis combos that trigger rules

// Column: product_type
IF([EDGE_FLAG], "SPECIAL_X", [product_type])

// Column: diagnosis_code (pair with SPECIAL_X)
IF([EDGE_FLAG], "Q87.1", [diagnosis_code])

Weird date patterns: backdated claims

// Column: claim_date (move back 30–365 days)
IF([EDGE_FLAG],
   DATEADD([claim_date], -RANDBETWEEN(30, 365), "day"),
   [claim_date]
)

Weird date patterns: coverage overlaps or ends before claim

// Column: coverage_end_date (ensure < claim_date to trigger exception)
IF([EDGE_FLAG],
   DATEADD([claim_date], -RANDBETWEEN(1, 30), "day"),
   [coverage_end_date]
)

Flip eligibility flags or set error codes

// Column: eligible
IF([EDGE_FLAG], FALSE, [eligible])

// Column: error_code
IF([EDGE_FLAG], "E_RULE_123", [error_code])

Masking:
- If you started from a de-identified copy of real claims, mock/mask remaining identifiers while keeping relationships so that edge cases are still context-rich.

Added value

Massive coverage of “what-if” conditions without having to wait for those rare cases in production; key for regression testing and rules engines.
Regulatory / audit confidence: you can prove that all critical business rules, alerts and exception flows have been tested against realistic but privacy-safe data.
Balanced datasets for QA and ML: AI synthesis covers privacy and keeps realistic distributions, but rule-based injects edge-case oversampling where you need it (e.g. fraud, high-cost outliers).

Previous10. AI synthesis: Data pre-processing when using NextGo-live requirements

Last updated 1 month ago

Was this helpful?

Good morning

hashtag1. Deterministic relations (e.g. male → male name)

hashtagScenario

hashtagHow you combine Syntho methods

hashtagAdded value

hashtag2. Absolute calculations (e.g. revenue – costs = profit)

hashtagScenario

hashtagHow you combine Syntho methods

hashtagAdded value

hashtag3. Hierarchical relationship (City > Province > Country)

hashtagScenario

hashtagHow you combine Syntho methods

hashtagAdded value

hashtag4. New data creation

hashtagScenario

hashtagHow you combine Syntho methods

hashtagAdded value

hashtag5. Edge case and rare scenario creation

hashtagScenario

hashtagHow you combine Syntho methods

hashtagAdded value

1. Deterministic relations (e.g. male → male name)

Scenario

How you combine Syntho methods

Added value

2. Absolute calculations (e.g. revenue – costs = profit)

Scenario

How you combine Syntho methods

Added value

3. Hierarchical relationship (City > Province > Country)

Scenario

How you combine Syntho methods

Added value

4. New data creation

Scenario

How you combine Syntho methods

Added value

5. Edge case and rare scenario creation

Scenario

How you combine Syntho methods

Added value