Use Case 12: Training & Education
Create safe, realistic datasets for onboarding, workshops, and hands-on training.
Use this use case when you need safe, realistic data for onboarding, workshops, and hands-on learning.
The focus is repeatable training scenarios without exposing real customer data.
What problem this use case solves
Training requires data that feels real.
Real production data is usually blocked by privacy, security, and access controls. Manually created demo data often lacks realism and breaks workflows.
You need a dataset that supports realistic exercises. You also need a quick reset between sessions.
When to choose this use case
Pick this when humans learn hands-on using realistic data.
If you’re unsure, start with Mock or mask all, keep the dataset small, and duplicate the workspace before every session.
You run onboarding, enablement, or workshops.
You need stable scenarios that always work.
Many trainees share the same dataset.
You need quick resets during sessions.
Use Consistent mapping only for storytelling.
When to avoid this use case
Skip this when training is not the purpose.
You share data externally beyond the training boundary. Use Use Case 9: Data Sharing & Monetization.
You need analytics-grade statistical utility. Use Use Case 7: Analytics Sandboxes.
You need stable demo narratives for product walkthroughs. Use Use Case 3: Demo Data.
If you need load testing at scale, prioritize volume profiles and destination tuning.
Recommended Syntho configuration
This setup is optimized for repeatable training exercises. You want datasets that reset fast. You want stable examples for step-by-step instructions.
Source & destination management
Create one workspace per training track. Examples: training-basics, training-pii, training-foreign-keys.
Duplicate a working workspace before big changes. This gives you a rollback point.
Use simple versioned names like
v1,v2,baseline, orpilot-partner-x.
Baseline rules
Keep the source stable. Prefer snapshots or back-ups.
Avoid a live production source for iterative work.
Keep the destination isolated. Never write into production.
Keep schemas aligned between source, workspace and destination.
Use views when you need only a subset of the original database.
Lifecycle rule of thumb
Keep the source connection when you expect schema changes.
Remove the source connection when you expect a new run only much later.
Revalidate after schema changes. Use Validate and synchronize workspace.
Nuances for this use case
Prefer mock-first sources. Don’t use production copies for training.
Keep datasets small. Resets should be minutes, not hours.
Don’t give trainees source access. Training should not be a backdoor to production-like data.
Don’t mix “storytelling” and “privacy” goals in one dataset. Use separate tracks or separate workspaces.
Configure generators
Workspace initialization mode
Choose a workspace mode. It applies baseline generator suggestions during workspace creation.
Recommended modes for this use case:
Mock or mask all when you want safe, realistic values with minimal reliance on the source.
Mock all when you want trainees to learn configuration from scratch without any production-like input.
De-identify when you have a production-like training dataset and you want to teach privacy-safe replacement patterns.
AI-generated synthesis
Use this when training includes analytics/ML concepts and you want realistic correlations without exposing real people.
Example (training on “churn”): synthesize a training_churn_features_view so participants can build a simple model or dashboard with realistic feature relationships.
Rule-based generation
Use this to make exercises deterministic and repeatable. Use Calculated columns to keep labs stable.
Example (scripted “bad rows” lab): inject a small, known set of malformed values learners must find and fix.
Masking
Use this when labs require format-valid fields for validation exercises and relational joins.
Example (PII lab): mask email and phone_number, then enable consistent mapping for customer_id so learners can see that joins still work after de-identification.
Hybrid
Use this when you want safe realism, plus scripted teaching scenarios.
Example (progressive lab setup): follow the hybrid patterns in Example data generation scenarios.
Mock names/addresses for safety.
Mask format-critical fields for validators (emails, UUIDs).
Use calculated columns to inject edge cases (LAB_BAD_ROW) and deterministic relations (e.g., gender → name).
If you want trainees to practice a classic “absolute calculation”, add this exercise:
Minimal configuration steps
Run a PII scan and review findings.
Apply mock/mask for the exercise scope.
Use calculated columns to inject lab tasks or derived fields.
Progressive training scenarios (recommended)
Design training so learners build confidence, then complexity.
Scenario A (Basics): PII scan + safe replacements
Goal: identify PII and apply mock/mask correctly.
Exercise: run PII scan, fix one false positive and one false negative, then generate.
Scenario B (Relational correctness): keys + foreign keys
Goal: keep joins working.
Exercise: add one virtual FK, validate, and re-run generation.
Scenario C (Edge cases): inject rare cases for testing
Goal: produce rows that trigger special logic.
Exercise: add an
EDGE_FLAGand override one column.
Handle keys and relationships (relational schemas)
If the training dataset is single-table, you can skip this step.
Training breaks fast on missing relationships.
Validate foreign keys early. Use Manage foreign keys. Add virtual foreign keys if the schema is incomplete.
Validate and sync
Validate a small slice first.
Run the exercises end-to-end as a trainee would.
Re-run validation whenever the training schema changes. Use Validate and synchronize workspace.
Tune generation settings
Prioritize fast reset times.
Stable settings make labs reproducible.
Use View and adjust generation settings once the exercises are stable.
Common pitfalls & misconfigurations
Use-case specific pitfalls
Using production copies for training environments.
Making datasets too big.
Training should reset in minutes.
Changing generator configs right before a session.
Duplicate a working workspace instead. See Duplicate a workspace.
Using consistent mapping by default.
Decide based on training goals: stable storytelling vs strict unlinkability.
General pitfalls
These pitfalls show up in most projects:
Running full-scale jobs before a small validation run.
Skipping workspace validation/sync after schema changes. Use Validate and synchronize workspace.
Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with Manage foreign keys and virtual foreign keys.
Overusing Consistent mapping (it slows down data generation and increases linkability).
Governance, compliance, and automation
Governance, access control, and audit evidence
Keep the workspace configuration as a controlled artifact. Treat it like “test data release”.
Recommended roles
Workspace Owner: data steward or privacy lead. Approves generator choices and sharing.
Workspace Editor: data engineer or platform engineer. Implements configuration changes.
Workspace Reader: testers, analysts, or trainees. Can run jobs but should not change rules.
See Workspace & user management and Share a workspace.
Access control checklist
Use read-only access to the source database for day-to-day users.
Restrict who can view source data in the UI. Don’t default to broad access.
Use a dedicated destination per environment (
dev,test,accept,sandbox).Keep external recipients in a separate workspace with stricter settings.
Evidence for auditors (lightweight but useful)
Capture these items per delivery or refresh:
Workspace name, owner, and intended audience.
PII scan results and the final list of “PII columns + applied generator type”.
Any enabled privacy controls (e.g., rare category protection, free-text de-identification scope).
Validation output and/or QA report (when applicable).
Approval notes (ticket link, privacy board sign-off, or risk acceptance).
Automation and deployment (reference)
You can automate workspace setup, scans, and generation runs via the Syntho REST API.
Last updated
Was this helpful?

