Use Case 6: ML model development
Generate feature datasets when real data is scarce or sensitive.
Use this use case when you need synthetic feature datasets for ML development.
What problem this use case solves
Teams need datasets for model development and validation. Data may be scarce, sensitive, or slow to access.
Classic anonymization can reduce the statistical utility needed for ML. It can also keep indirect signals that are still privacy-sensitive.
When to choose this use case
Pick this when you build ML models and need statistical utility.
If you’re unsure, start with Synthesize all on a single training table (entity table or view) and run the QA report before training.
You need synthetic feature datasets for training and evaluation.
You want new rows without 1:1 links to real people.
You can train on an entity table or training view.
You want privacy-safe iteration without production access.
When to avoid this use case
Skip this when you need strict correctness or reversibility.
You need deterministic, join-correct multi-table datasets for reconciliation or regression assertions. Use Use Case 4: ETL & Data Pipeline Testing.
You need 100% constraint adherence or stable pseudonyms that can be traced across tables. Use Use Case 1: Application & API Testing.
You need an analyst sandbox for exploration and BI. Use Use Case 7: Analytics Sandboxes.
You need dev data for feature work, not modeling utility. Prefer mock-first generation and a small scope.
Recommended Syntho configuration
This setup is optimized for model development utility with strong privacy. You generate new rows. You avoid any 1:1 link to original records.
Prerequisites
Checklist
Avoid leakage. Exclude post-outcome timestamps and human decisions from features.
Use the Prerequisites checklist.
Follow AI synthesis: Data pre-processing when the source is not an entity table yet.
Source & destination management
Create one workspace per feature dataset or model track. This keeps training and evaluation reproducible.
Use separate workspaces for different privacy settings. Privacy settings are part of your model governance.
Baseline rules
Keep the source stable. Prefer snapshots or back-ups.
Avoid a live production source for iterative work.
Keep the destination isolated. Never write into production.
Keep schemas aligned between source, workspace and destination.
Use views when you need only a subset of the original database.
Lifecycle rule of thumb
Keep the source connection when you expect schema changes.
Remove the source connection when you expect a new run only much later.
Revalidate after schema changes. Use Validate and synchronize workspace.
Nuances for this use case
Use a view to reduce leakage risk. Keep post-outcome fields out of the training cut.
Prefer clean dataset versioning (
features_v1,features_v2). Avoid a shared “analytics” schema that loses lineage.Don’t default to de-identification. For ML, you often need stronger unlinkability than de-identification provides.
Configure generators
Workspace initialization mode
Choose a workspace mode. It applies baseline generator suggestions during workspace creation.
Recommended modes for this use case:
Synthesize all for model development datasets (best default when you have an entity table).
De-identify only when you must preserve multi-table behavior and don’t need maximum unlinkability.
From scratch when you’re curating a very specific feature table and want manual control.
AI-generated synthesis
This is the primary method here. Use it when you need statistical utility and strong privacy without 1:1 record links.
Example (training table via view): build training_entity_view (features + label), then apply AI synthesize to generate a training dataset for modeling. Run the QA report before training.
Rule-based generation
Use this to enforce feature constraints, bucketing, or label logic that must be explicit (or to remove leakage). Use Calculated columns for transparent, auditable rules.
Example (leakage-safe bucketing): create a calculated age_band (0–17, 18–34, 35–54, 55+) and drop raw date_of_birth. Train on the band to reduce leakage and privacy risk.
Masking
Use this only for columns that must stay format-valid for downstream tooling. Avoid masking identifiers for ML unless strictly required.
Example (pipeline contract): if your training pipeline validates an email format, apply Mask → Email but exclude the column from the model features. Keep it as a non-training field for compatibility only.
Hybrid
Use this when you want AI synthesis for utility, plus explicit rules for stability and governance.
Example (utility + business rules): AI synthesize core features, then add a deterministic segmentation flag (matches “absolute calculations” style of thinking).
Minimal configuration steps
Build
training_entity_view(one row per entity).Apply AI synthesize and validate with the QA report.
Add calculated columns for bucketing or governance flags only.
Optional: feature engineering
Prefer engineered features over raw identifiers and raw notes.
Drop or recompute derived columns to avoid leakage.
If you keep raw text, use Free text de-identification and scope it tightly.
If your real data is relational, create a training view first. See Use SQL views as input tables.
Handle keys and relationships (relational schemas)
This use case typically trains on a single entity table. If you already have that table (or a view), you can skip PK/FK configuration.
If your source is relational, decide what becomes the entity table.
Use Cross-table relationships limitations to decide whether to reshape to a single entity table or use de-identification for relationship-heavy schemas.
Validate and sync
Run the QA report when available. Use it to sanity-check utility and privacy before training.
If you update the schema or feature set, revalidate. Small schema changes can invalidate a model comparison.
Tune generation settings
Tune for training stability. Prefer fewer reruns with stable outputs over maximum speed.
Apply Additional privacy controls before publishing datasets outside the model team.
Common pitfalls & misconfigurations
Use-case specific pitfalls
Starting AI synthesis without an entity-table style dataset.
Expecting AI synthesis to preserve cross-system consistency across multiple systems.
Treating QA results as optional when the output is used for model validation.
Training on redundant or derived columns (e.g. totals derived from components).
Remove derived columns first. See AI synthesize.
General pitfalls
These pitfalls show up in most projects:
Running full-scale jobs before a small validation run.
Skipping workspace validation/sync after schema changes. Use Validate and synchronize workspace.
Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with Manage foreign keys and virtual foreign keys.
Overusing Consistent mapping (it slows down data generation and increases linkability).
Governance, compliance, and automation
Use-case specific recommendations
Version datasets like model inputs (
features_v1,features_v2). Store generation settings + QA report with the experiment.Separate training vs evaluation datasets. Don’t generate both from the same workspace settings without intent.
Gate model training on a QA review (utility + privacy sanity check). Capture acceptance criteria in the ticket.
If outputs leave the ML team, require an explicit privacy review and apply additional privacy controls before distribution.
General recommendations
Use these recommendations for most workspaces.
Ownership and change control
Assign a single workspace owner (data steward / privacy lead / DBA).
Require a ticket or change request for generator changes.
Duplicate a workspace before large edits. Keep the previous version as rollback.
Access control
Default to read-only access for source connections.
Restrict who can view source data in the UI.
Use separate workspaces per environment or audience.
Automation (baseline)
Use the Syntho REST API to standardize scans and runs.
Automate data generation not workspace configuration.
Keep job logs for failed runs. This reduces back-and-forth during support.
Last updated
Was this helpful?

