# Example data generation scenarios

The examples below show how you can combine different Syntho generators to get the data that you need for your use case. Note that:

{% hint style="info" %}
For multi-table setups:

**AI synthesis** works best on **single tables**.

If you need to preserve **cross-table relationships**, **replace the AI synthesis examples below with duplicate,** or **mask/mock/rule-based generators using consistent mapping.**
{% endhint %}

### 1. Deterministic relations (e.g. male → male name)

#### Scenario

Hospital EHR table `patients` with:

* `patient_id`
* `gender` (M/F/X)
* `first_name`, `last_name`
* `date_of_birth`
* `diagnosis_code`
* `visit_count`

#### How you combine Syntho methods

* **AI synthesis** for:
  * `date_of_birth`, `diagnosis_code`, `visit_count`, and other clinical / behavioral fields, so you preserve realistic age distributions, ICD code combinations, visit patterns, etc., without 1-to-1 linkage to real people.
* **Mockers + calculated columns** for deterministic gender–name relation:
  * Use **Calculated Column** with mockers for first\_name and last\_name:\
    `IF([gender] = 'M', MOCK_FIRST_NAME_MALE, IF([gender] = 'F', MOCK_FIRST_NAME_FEMALE, MOCK_FIRST_NAME_UNISEX))`
* **Masking** for IDs:
  * Hash `patient_id` so you can still join across tables but never see the real identifier.

#### Added value

* **Business rule testing**: devs can test logic like “if gender = female, show pregnancy-related questions” on fully synthetic data that *always* respects the gender–name relation.
* **Realistic UX demos**: clinicians and product owners see plausible names matching gender, not nonsense combinations that break trust.

### 2. Absolute calculations (e.g. revenue – costs = profit)

#### Scenario

Banking `accounts` table:

* `account_id`
* `revenue`
* `costs`
* `profit`
* `segment`, `region`, etc.

In real systems, `profit = revenue – costs` and many downstream rules rely on that.

#### How you combine Syntho methods

* **AI synthesis**:
  * Use AI synthesis to generate realistic **joint distributions** of `revenue`, `costs`, `segment`, `region`, churn risk, etc. (statistically similar to real portfolio but no 1:1 linkage).
* **Rule-based calculated columns**:
  * Overwrite `profit` with a **Calculated Column**:\
    `[revenue] - [costs]`
  * Optionally add extra business logic:\
    `IF([revenue] - [costs] < 0, 0, [revenue] - [costs])` for scenarios where profit is stored as 0 instead of negative.
* **Masking / hashing**:
  * Hash `account_id` and mask PII (e.g. `IBAN`) while preserving format.

#### Added value

* **Guaranteed accounting consistency**: AI handles realistic ranges and correlations; rule-based logic guarantees accounting identities exactly match business rules so regression tests never fail on “impossible” numbers.
* **Stress testing analytics**: you can safely share this data with external consultants / vendors; they can calculate margins, perform profitability analysis and validate models, knowing the math holds.
* **Explaining model results**: analysts can cross-check their profitability dashboards on synthetic data that behaves exactly like production in terms of formulas.

### 3. Hierarchical relationship (City > Province > Country)

#### Scenario

Government benefits system:

* `citizen_id`
* `city`
* `province`
* `country`
* `benefit_type`
* `benefit_amount`

City, province and country must **always** form valid combinations.

#### How you combine Syntho methods

* **AI synthesis**:
  * Use AI synthesis on the *entity table* to generate realistic patterns for `benefit_type`, `benefit_amount`, `city` , demographics, etc.
* **Rule-based + mockers for geography hierarchy**:
  * Keep a real or mock **reference table** `geo_dim`:
    * `city`, `province`, `country` with valid combinations.
  * Use **Calculated Columns** or lookups to set `province` and `country` based on `city`:
    * e.g.

      ```
      SWITCH(UPPER(TRIM([CITY])), "TORONTO", "ONTARIO", "MONTREAL", "QUEBEC", ...)
      ```
* **Masking**:
  * For privacy, mask or aggregate small cities into “Other city (Province)” to avoid re-identification, while still keeping hierarchical consistency.

#### Added value

* **Location-based logic behaves correctly** (regional eligibility rules, tax brackets, language settings), because city–province–country combos are always valid.
* **Geospatial analytics** (heatmaps, per-province dashboards) remain meaningful on synthetic data, enabling external sharing without privacy issues.
* **Complex test scenarios**: QA can create rule-based test cases (e.g. cross-border regions, provinces with special rules) without messing up global statistics that AI synthesis preserves.

### 4. New data creation

#### Scenario

You’re launching a **new SaaS product** and don’t have real customers yet, but you want:

* A realistic multi-tenant database for demos and end-to-end integration testing.
* Later, once you have data, to augment it with AI-synthesized rows.

#### How you combine Syntho methods

1. **Phase 1 – no data yet (pure rule-based / mock)**
   * Use **Mock generators** to generate:
     * `company_name`, `user_email`, `first_name`, `last_name`, etc.
   * **Rule-based calculated columns**:
     * Column: tenant\_id
       * Set generator in column settings to Key generators → Generate. (Creates unique synthetic keys; no formula needed.)
     * Column: user\_email

       ```
       // Company email per user
       LOWER(CONCATENATE([FIRST_NAME], ".", [LAST_NAME], "@", MOCK_FREE_EMAIL_DOMAIN_0))
       ```
     * Column: role (weighted distribution)

       ```
       // ~10% OWNER, 20% ADMIN, 70% MEMBER
       IFS(
         RANDBETWEEN(1,100) <= 10,  "OWNER",
         RANDBETWEEN(1,100) <= 30,  "ADMIN",
         TRUE,                      "MEMBER"
       )
       ```
     * Column: signup\_date (recent signups)

       ```
       DATEADD(TODAY(), -RANDBETWEEN(0, 90), "day")
       ```
     * Column: trial\_end\_date = signup\_date + 14 days

       ```
       DATEADD([signup_date], 14, "day")
       ```
     * Column: plan\_tier (map by company size proxy)

       ```
       // Size proxy via random users per tenant influences tier
       SWITCH(TRUE,
         RANDBETWEEN(1,100) <= 10, "ENTERPRISE",
         RANDBETWEEN(1,100) <= 40, "PRO",
         "FREE"
       )
       ```
2. **Phase 2 – once you have initial data**:
   * Optionally, train on **AI synthesis** on the production entity table (per tenant or global) to create more users, sessions, transactions, etc., preserving correlations (usage patterns, feature adoption).
   * Combine with **calculated columns** to enforce specific business rules (e.g. multi-tenant isolation, SLA tiers).
     * Column: sla\_tier (carry forward or rebalance)

       ```
       // Keep existing if present, else assign
       IF(ISNULL([sla_tier]),
          IFS(RANDBETWEEN(1,100)<=10,"GOLD", RANDBETWEEN(1,100)<=40,"SILVER", TRUE,"BRONZE"),
          [sla_tier]
       )
       ```
     * Column: response\_sla\_hours (business rule from SLA tier)

       ```
       SWITCH(UPPER(TRIM([sla_tier])),
         "GOLD",   4,
         "SILVER", 8,
         "BRONZE", 24,
         24
       )
       ```
     * Column: trial\_end\_date (ensure consistent with signup\_date from mixed sources)

       ```
       IF(ISNULL([trial_end_date]),
          DATEADD([signup_date], 14, "day"),
          [trial_end_date]
       )
       ```
     * Column: environment\_flag (separate demo/test tenants)

       ```
       // Oversample demo tenants for testing flows
       IFS(
         RANDBETWEEN(1,100) <= 20, "DEMO",
         TRUE,                    "PROD"
       )
       ```
     * Column: tenant\_isolation\_guard (hard guard rails)

       ```
       // Example: block cross-tenant sharing flag in demo data
       IF([environment_flag]="DEMO", FALSE, [sharing_enabled])
       ```
3. **Add masking when seeding from production**:
   * If you seed from a production snapshot, use calculated columns/mask/mock for PII and AI/duplicate/mask/mock/calculated columns for behavioral attributes.

#### Added value

* **Instantly available realistic test DB** even before going live.
* As you grow, **AI synthesis scales up data volume** with realistic patterns, while rule-based generators maintain product-specific rules and identities (tenants, subscription tiers).

### 5. Edge case and rare scenario creation

#### Scenario

Health insurance claims system where:

* 99.9% of cases are “normal”
* But test teams need **lots of edge cases**:
  * extremely high claim amounts
  * rare combinations of diagnosis codes
  * weird date patterns (backdated claims, overlapping coverage)
  * special product types with exception logic

#### How you combine Syntho methods

* **AI synthesis**:
  * Generate the **normal background population** of claims based on your source data: realistic volumes, distributions, seasonal patterns, co-occurrence of diagnoses & procedures.
* **Rule-based edge case injection**:
  * Use **Calculated Columns** and **Mockers** to override or add specific rare scenarios on top of AI synthetic data:
    * Selector for 0.1% edge rows.

      ```
      // New column: EDGE_FLAG (TRUE for ~0.1% of rows)
      RAND() < 0.001
      ```
    * Extremely high claim amounts

      ```
      // Column: claim_amount
      IF([EDGE_FLAG],
         RANDBETWEEN(100000, 500000),
         [claim_amount]
      )
      ```
    * Rare diagnosis codes (weighted)

      ```
      // Column: diagnosis_code
      IFS(
        AND([EDGE_FLAG], RAND() < 0.33),      "E75.5",           // lysosomal storage d/o (example)
        AND([EDGE_FLAG], RAND() < 0.66),      "G12.2",           // motor neuron d/o (example)
        AND([EDGE_FLAG], TRUE),               "D42.0",           // rare tumor (example)
        TRUE,                                 [diagnosis_code]   // default keep
      )
      ```
    * Force product/diagnosis combos that trigger rules

      ```
      // Column: product_type
      IF([EDGE_FLAG], "SPECIAL_X", [product_type])
      ```

      ```
      // Column: diagnosis_code (pair with SPECIAL_X)
      IF([EDGE_FLAG], "Q87.1", [diagnosis_code])
      ```
    * Weird date patterns: backdated claims

      ```
      // Column: claim_date (move back 30–365 days)
      IF([EDGE_FLAG],
         DATEADD([claim_date], -RANDBETWEEN(30, 365), "day"),
         [claim_date]
      )
      ```
    * Weird date patterns: coverage overlaps or ends before claim

      ```
      // Column: coverage_end_date (ensure < claim_date to trigger exception)
      IF([EDGE_FLAG],
         DATEADD([claim_date], -RANDBETWEEN(1, 30), "day"),
         [coverage_end_date]
      )
      ```
    * Flip eligibility flags or set error codes

      ```
      // Column: eligible
      IF([EDGE_FLAG], FALSE, [eligible])
      ```

      ```
      // Column: error_code
      IF([EDGE_FLAG], "E_RULE_123", [error_code])
      ```
* **Masking**:
  * If you started from a de-identified copy of real claims, mock/mask remaining identifiers while keeping relationships so that edge cases are still context-rich.

#### Added value

* **Massive coverage of “what-if” conditions** without having to wait for those rare cases in production; key for regression testing and rules engines.
* **Regulatory / audit confidence**: you can prove that all critical business rules, alerts and exception flows have been tested against realistic but privacy-safe data.
* **Balanced datasets for QA and ML**: AI synthesis covers privacy and keeps realistic distributions, but rule-based injects edge-case oversampling where you need it (e.g. fraud, high-cost outliers).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/syntho-bootcamp/example-data-generation-scenarios.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
