# Use Case 6: ML model development

Use this use case when you need synthetic feature datasets for ML development.

### What problem this use case solves

Teams need datasets for model development and validation. Data may be scarce, sensitive, or slow to access.

Classic anonymization can reduce the statistical utility needed for ML. It can also keep indirect signals that are still privacy-sensitive.

### When to choose this use case

Pick this when you build ML models and need statistical utility.

If you’re unsure, start with **Synthesize all** on a single training table (entity table or view) and run the [QA report](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/qa-report.md) before training.

* You need synthetic feature datasets for training and evaluation.
* You want new rows without 1:1 links to real people.
* You can train on an entity table or training view.
* You want privacy-safe iteration without production access.

### When to avoid this use case

Skip this when you need strict correctness or reversibility.

* You need deterministic, join-correct multi-table datasets for reconciliation or regression assertions. Use [Use Case 4: ETL & Data Pipeline Testing](/overview/get-started/use-cases-and-configuration/use-case-4-etl-and-data-pipeline-testing.md).
* You need 100% constraint adherence or stable pseudonyms that can be traced across tables. Use [Use Case 1: Application & API Testing](/overview/get-started/use-cases-and-configuration/use-case-1-application-and-api-testing.md).
* You need an analyst sandbox for exploration and BI. Use [Use Case 7: Analytics Sandboxes](/overview/get-started/use-cases-and-configuration/use-case-7-analytics-sandboxes.md).
* You need dev data for feature work, not modeling utility. Prefer mock-first generation and a small scope.

### Recommended Syntho configuration

This setup is optimized for **model development utility with strong privacy**. You generate new rows. You avoid any 1:1 link to original records.

{% stepper %}
{% step %}

#### Prerequisites

**Checklist**

* [ ] Training table/view is defined (entity table).
* [ ] Inputs vs targets vs leakage columns are decided.
* [ ] Direct identifiers removed from features.

{% hint style="warning" %}
Avoid leakage. Exclude post-outcome timestamps and human decisions from features.
{% endhint %}

* Use the [Prerequisites](/overview/get-started/prerequisites.md) checklist.
* Follow [AI synthesis: Data pre-processing](/overview/get-started/syntho-bootcamp/10.-ai-synthesis-data-pre-processing-when-using.md) when the source is not an entity table yet.
  {% endstep %}

{% step %}

#### Source & destination management

Create one workspace per feature dataset or model track. This keeps training and evaluation reproducible.

Use separate workspaces for different privacy settings. Privacy settings are part of your model governance.

#### Baseline rules

* Keep the **source stable**. Prefer snapshots or back-ups.
* Avoid a **live production** source for iterative work.
* Keep the **destination isolated**. Never write into production.
* Keep **schemas aligned** between source, workspace and destination.
* Use **views** when you need only a subset of the original database.

#### Lifecycle rule of thumb

* Keep the source connection when you expect schema changes.
* Remove the source connection when you expect a new run only much later.
* Revalidate after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).

**Nuances for this use case**

* Use a view to reduce leakage risk. Keep post-outcome fields out of the training cut.
* Prefer clean dataset versioning (`features_v1`, `features_v2`). Avoid a shared “analytics” schema that loses lineage.
* Don’t default to de-identification. For ML, you often need stronger unlinkability than de-identification provides.
* [Create a workspace](/setup-workspaces/create-a-workspace.md)
  {% endstep %}

{% step %}

#### Configure generators

**Workspace initialization mode**

Choose a [workspace mode](/setup-workspaces/create-a-workspace/workspace-modes.md). It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

* **Synthesize all** for model development datasets (best default when you have an entity table).
* **De-identify** only when you must preserve multi-table behavior and don’t need maximum unlinkability.
* **From scratch** when you’re curating a very specific feature table and want manual control.

**AI-generated synthesis**

This is the primary method here. Use it when you need **statistical utility** and **strong privacy** without 1:1 record links.

**Example (training table via view):** build `training_entity_view` (features + label), then apply **AI synthesize** to generate a training dataset for modeling. Run the [QA report](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/qa-report.md) before training.

**Rule-based generation**

Use this to enforce **feature constraints**, **bucketing**, or **label logic** that must be explicit (or to remove leakage). Use [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md) for transparent, auditable rules.

**Example (leakage-safe bucketing):** create a calculated `age_band` (`0–17`, `18–34`, `35–54`, `55+`) and drop raw `date_of_birth`. Train on the band to reduce leakage and privacy risk.

```excel-formula
// New column: age_band (train on band, then exclude raw date_of_birth)
IFS(
  YEAR(TODAY()) - YEAR([date_of_birth]) < 18,  "0-17",
  YEAR(TODAY()) - YEAR([date_of_birth]) < 35,  "18-34",
  YEAR(TODAY()) - YEAR([date_of_birth]) < 55,  "35-54",
  TRUE,        "55+"
)
```

**Masking**

Use this only for columns that must stay **format-valid** for downstream tooling. Avoid masking identifiers for ML unless strictly required.

**Example (pipeline contract):** if your training pipeline validates an `email` format, apply **Mask → Email** but exclude the column from the model features. Keep it as a non-training field for compatibility only.

**Hybrid**

Use this when you want AI synthesis for utility, plus explicit rules for stability and governance.

**Example (utility + business rules):** AI synthesize core features, then add a deterministic segmentation flag (matches “absolute calculations” style of thinking).

```excel-formula
// New column: is_high_value (business rule segment)
IF([spend_90d] >= 1000, TRUE, FALSE)
```

**Minimal configuration steps**

1. Build `training_entity_view` (one row per entity).
2. Apply **AI synthesize** and validate with the [QA report](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/qa-report.md).
3. Add calculated columns for bucketing or governance flags only.

* [Automatic PII discovery with PII scanner](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md)
* [Manage personally identifiable information (PII)](/configure-a-data-generation-job/manage-personally-identifiable-information-pii.md)

<details>

<summary>Optional: feature engineering</summary>

* Prefer engineered features over raw identifiers and raw notes.
* Drop or recompute derived columns to avoid leakage.
* If you keep raw text, use [Free text de-identification](/overview/get-started/syntho-bootcamp/5.-generators/free-text-de-identification.md) and scope it tightly.

If your real data is relational, create a training view first. See [Use SQL views as input tables](/setup-workspaces/create-a-workspace/use-sql-views-as-input-tables.md).

</details>
{% endstep %}

{% step %}

#### Handle keys and relationships (relational schemas)

This use case typically trains on a **single entity table**. If you already have that table (or a view), you can skip PK/FK configuration.

If your source is relational, decide what becomes the entity table.

Use [Cross-table relationships limitations](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/table-relationships.md) to decide whether to reshape to a single entity table or use de-identification for relationship-heavy schemas.
{% endstep %}

{% step %}

#### Validate and sync

Run the [QA report](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/qa-report.md) when available. Use it to sanity-check utility and privacy before training.

If you update the schema or feature set, revalidate. Small schema changes can invalidate a model comparison.
{% endstep %}

{% step %}

#### Tune generation settings

Tune for training stability. Prefer fewer reruns with stable outputs over maximum speed.

Apply [Additional privacy controls](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/privacy-controls.md) before publishing datasets outside the model team.
{% endstep %}
{% endstepper %}

### Common pitfalls & misconfigurations

#### Use-case specific pitfalls

* Starting AI synthesis without an entity-table style dataset.
* Expecting AI synthesis to preserve cross-system consistency across multiple systems.
* Treating QA results as optional when the output is used for model validation.
* Training on redundant or derived columns (e.g. totals derived from components).
  * Remove derived columns first. See [AI synthesize](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation.md).

<details>

<summary>General pitfalls</summary>

These pitfalls show up in most projects:

* Running full-scale jobs before a small validation run.
* Skipping workspace validation/sync after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).
* Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md) and [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md).
* Leaving sensitive columns on [**Duplicate**](/configure-a-data-generation-job/configure-column-settings/duplicate.md), or trusting the [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) without reviewing false positives/negatives.
* Overusing [**Consistent mapping**](/configure-a-data-generation-job/configure-column-settings/consistent-mapping.md) (it slows down data generation and increases linkability).

</details>

### Governance, compliance, and automation

#### Use-case specific recommendations

* Version datasets like model inputs (`features_v1`, `features_v2`). Store generation settings + QA report with the experiment.
* Separate training vs evaluation datasets. Don’t generate both from the same workspace settings without intent.
* Gate model training on a QA review (utility + privacy sanity check). Capture acceptance criteria in the ticket.
* If outputs leave the ML team, require an explicit privacy review and apply additional privacy controls before distribution.

<details>

<summary>General recommendations</summary>

Use these recommendations for most workspaces.

#### Ownership and change control

* Assign a single **workspace owner** (data steward / privacy lead / DBA).
* Require a ticket or change request for generator changes.
* Duplicate a workspace before large edits. Keep the previous version as rollback.

#### Access control

* Default to **read-only** access for source connections.
* Restrict **who can view source data** in the UI.
* Use separate workspaces per environment or audience.

#### Automation (baseline)

* Use the [Syntho REST API](/syntho-api/syntho-rest-api.md) to standardize scans and runs.
* Automate data generation not workspace configuration.
* Keep job logs for failed runs. This reduces back-and-forth during support.

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/use-cases-and-configuration/use-case-6-ml-model-development.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
