Use Case 12: Training & Education

Create safe, realistic datasets for onboarding, workshops, and hands-on training.

Use this use case when you need safe, realistic data for onboarding, workshops, and hands-on learning.

The focus is repeatable training scenarios without exposing real customer data.

What problem this use case solves

Training requires data that feels real.

Real production data is usually blocked by privacy, security, and access controls. Manually created demo data often lacks realism and breaks workflows.

You need a dataset that supports realistic exercises. You also need a quick reset between sessions.

When to choose this use case

Pick this when humans learn hands-on using realistic data.

If you’re unsure, start with Mock or mask all, keep the dataset small, and duplicate the workspace before every session.

  • You run onboarding, enablement, or workshops.

  • You need stable scenarios that always work.

  • Many trainees share the same dataset.

  • You need quick resets during sessions.

  • Use Consistent mapping only for storytelling.

When to avoid this use case

Skip this when training is not the purpose.

This setup is optimized for repeatable training exercises. You want datasets that reset fast. You want stable examples for step-by-step instructions.

1

Prerequisites

Checklist

2

Source & destination management

Create one workspace per training track. Examples: training-basics, training-pii, training-foreign-keys.

  • Duplicate a working workspace before big changes. This gives you a rollback point.

  • Use simple versioned names like v1, v2, baseline, or pilot-partner-x.

Baseline rules

  • Keep the source stable. Prefer snapshots or back-ups.

  • Avoid a live production source for iterative work.

  • Keep the destination isolated. Never write into production.

  • Keep schemas aligned between source, workspace and destination.

  • Use views when you need only a subset of the original database.

Lifecycle rule of thumb

  • Keep the source connection when you expect schema changes.

  • Remove the source connection when you expect a new run only much later.

  • Revalidate after schema changes. Use Validate and synchronize workspace.

Nuances for this use case

  • Prefer mock-first sources. Don’t use production copies for training.

  • Keep datasets small. Resets should be minutes, not hours.

  • Don’t give trainees source access. Training should not be a backdoor to production-like data.

  • Don’t mix “storytelling” and “privacy” goals in one dataset. Use separate tracks or separate workspaces.

3

Configure generators

Workspace initialization mode

Choose a workspace mode. It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

  • Mock or mask all when you want safe, realistic values with minimal reliance on the source.

  • Mock all when you want trainees to learn configuration from scratch without any production-like input.

  • De-identify when you have a production-like training dataset and you want to teach privacy-safe replacement patterns.

AI-generated synthesis

Use this when training includes analytics/ML concepts and you want realistic correlations without exposing real people.

Example (training on “churn”): synthesize a training_churn_features_view so participants can build a simple model or dashboard with realistic feature relationships.

Rule-based generation

Use this to make exercises deterministic and repeatable. Use Calculated columns to keep labs stable.

Example (scripted “bad rows” lab): inject a small, known set of malformed values learners must find and fix.

Masking

Use this when labs require format-valid fields for validation exercises and relational joins.

Example (PII lab): mask email and phone_number, then enable consistent mapping for customer_id so learners can see that joins still work after de-identification.

Hybrid

Use this when you want safe realism, plus scripted teaching scenarios.

Example (progressive lab setup): follow the hybrid patterns in Example data generation scenarios.

  1. Mock names/addresses for safety.

  2. Mask format-critical fields for validators (emails, UUIDs).

  3. Use calculated columns to inject edge cases (LAB_BAD_ROW) and deterministic relations (e.g., gender → name).

If you want trainees to practice a classic “absolute calculation”, add this exercise:

Minimal configuration steps

  1. Run a PII scan and review findings.

  2. Apply mock/mask for the exercise scope.

  3. Use calculated columns to inject lab tasks or derived fields.

4

Handle keys and relationships (relational schemas)

If the training dataset is single-table, you can skip this step.

Training breaks fast on missing relationships.

Validate foreign keys early. Use Manage foreign keys. Add virtual foreign keys if the schema is incomplete.

5

Validate and sync

Validate a small slice first.

Run the exercises end-to-end as a trainee would.

Re-run validation whenever the training schema changes. Use Validate and synchronize workspace.

6

Tune generation settings

Prioritize fast reset times.

Stable settings make labs reproducible.

Use View and adjust generation settings once the exercises are stable.

Common pitfalls & misconfigurations

Use-case specific pitfalls

  • Using production copies for training environments.

  • Making datasets too big.

    • Training should reset in minutes.

  • Changing generator configs right before a session.

  • Using consistent mapping by default.

    • Decide based on training goals: stable storytelling vs strict unlinkability.

chevron-rightGeneral pitfallshashtag

These pitfalls show up in most projects:

chevron-rightGovernance, compliance, and automationhashtag

Governance, access control, and audit evidence

Keep the workspace configuration as a controlled artifact. Treat it like “test data release”.

  • Workspace Owner: data steward or privacy lead. Approves generator choices and sharing.

  • Workspace Editor: data engineer or platform engineer. Implements configuration changes.

  • Workspace Reader: testers, analysts, or trainees. Can run jobs but should not change rules.

See Workspace & user management and Share a workspace.

Access control checklist

  • Use read-only access to the source database for day-to-day users.

  • Restrict who can view source data in the UI. Don’t default to broad access.

  • Use a dedicated destination per environment (dev, test, accept, sandbox).

  • Keep external recipients in a separate workspace with stricter settings.

Evidence for auditors (lightweight but useful)

Capture these items per delivery or refresh:

  • Workspace name, owner, and intended audience.

  • PII scan results and the final list of “PII columns + applied generator type”.

  • Any enabled privacy controls (e.g., rare category protection, free-text de-identification scope).

  • Validation output and/or QA report (when applicable).

  • Approval notes (ticket link, privacy board sign-off, or risk acceptance).

Automation and deployment (reference)

You can automate workspace setup, scans, and generation runs via the Syntho REST API.

Last updated

Was this helpful?