Use Case 7: Analytics sandboxes
Secure sandboxes for exploratory analytics and data science.
Use this use case when analysts need access to data while keeping privacy risk controlled. The focus is utility for exploration with privacy-by-design controls.
What problem this use case solves
Teams need safe environments for exploration. They need distributions and correlations that remain useful.
Classic anonymization can remove detail and distort distributions. It can also require multiple iterations to meet privacy needs.
When to choose this use case
Pick this when analysts need access, but production is restricted.
If you’re unsure, start with Synthesize all on a curated entity table and restrict access via Share a workspace.
You need exploration and BI dashboards without production access.
You need correlations and distributions to stay useful.
Many users need the same refreshable dataset.
You need access control and privacy controls built-in.
Use De-identify when analysts need production-like multi-table joins.
When to avoid this use case
Skip this when exploration is not the goal.
You need strict rule compliance for every row. Use Use Case 4: ETL & Data Pipeline Testing.
You need feature datasets for ML training and evaluation. Use Use Case 6: ML Model Development.
You need external data sharing with approvals and evidence. Use Use Case 9: Data Sharing & Monetization.
You mainly need load and stress testing at scale. Focus on volume profiles and throughput tuning instead of analyst UX.
Recommended Syntho configuration
This setup is optimized for exploration with controlled privacy risk. You preserve correlations and distributions. You reduce re-identification risk through privacy controls and access control.
Source & destination management
Create one workspace per audience or policy. Example: sandbox-analysts vs sandbox-data-science.
Baseline rules
Keep the source stable. Prefer snapshots or back-ups.
Avoid a live production source for iterative work.
Keep the destination isolated. Never write into production.
Keep schemas aligned between source, workspace and destination.
Use views when you need only a subset of the original database.
Lifecycle rule of thumb
Keep the source connection when you expect schema changes.
Remove the source connection when you expect a new run only much later.
Revalidate after schema changes. Use Validate and synchronize workspace.
Nuances for this use case
Use roles and sharing to enforce access boundaries. Only a small group should change generators.
Prefer blue/green schemas for refresh. Avoid breaking dashboards mid-refresh.
Avoid in-place refreshes. Users see partial data and inconsistent aggregates.
Don’t hide join keys in source views. Analysts will rebuild them manually and create privacy risk.
Configure generators
Workspace initialization mode
Choose a workspace mode. It applies baseline generator suggestions during workspace creation.
Recommended modes for this use case:
Synthesize all when you can provide an entity table and you want strong utility for exploration.
De-identify when analysts need multi-table joins that behave like production (and you mainly replace identifiers).
Mock or mask all for “safe-by-default” sandboxes with minimal dependency on the original distributions.
AI-generated synthesis
Use this when analysts need correlations and distributions to stay useful for exploration.
Example (BI-ready entity view): create sandbox_sales_entity_view (customer segment, channel, order totals), then AI synthesize it into a single fact table for dashboards without exposing production data.
Rule-based generation
Use this when you must enforce reporting conventions or guarantee certain slices exist for dashboards. Use Calculated columns to keep dashboards stable.
Example (stable “last 30 days” charts): add a calculated EDGE_RECENT flag (e.g., 20% of rows), then set order_date to a random value in the last 30 days when EDGE_RECENT is true. This avoids empty “recent activity” charts after refresh.
Masking
Use this when BI tooling expects format-valid codes or when analysts need stable join keys in a de-identified relational sandbox.
Example (stable dimension joins): de-identify identifiers, keep consistent mapping for product_id, and mask postal_code to valid formats so geography dashboards and joins behave predictably.
Hybrid
Use this when you need both utility and operational stability for many users.
Example (hierarchy correctness for geography dashboards): enforce “city → province → country” in a dimension table (matches the “hierarchical relationship” scenario).
Minimal configuration steps
Create one curated entity view (BI-friendly).
Prefer AI synthesize for the entity table.
Apply masking/de-identification for identifiers that remain.
Optional: BI-friendly dataset shape
Prefer one fact table + small dimensions.
Keep common filters (
country,segment,channel).Avoid raw free-text unless needed.
If you want one query-friendly table, create a view first. See Use SQL views as input tables.
Handle keys and relationships (relational schemas)
If you publish a single sandbox table (no joins), you can skip this step.
If the sandbox needs joins, make FKs explicit. Analysts will join tables in unpredictable ways.
If you do not need joins, flatten into an entity table before synthesis. This reduces privacy risk and simplifies validation.
Validate and sync
Use the QA report when available to validate utility and privacy.
Revalidate after each refresh cycle. Sandbox users notice drift quickly.
Tune generation settings
Tune for interactive performance. Sandboxes are query-heavy.
Apply Additional privacy controls before widening access to more users.
Use View and adjust generation settings when query latency becomes the bottleneck.
Refresh and rollback strategy (low disruption)
Avoid breaking dashboards during refreshes:
Keep two destination schemas:
sandbox_blueandsandbox_green.Refresh the inactive schema, validate dashboards, then switch BI connections.
If something breaks, roll back by switching back to the previous schema.
Common pitfalls & misconfigurations
Use-case specific pitfalls
Publishing sandboxes that still contain sensitive identifiers.
Using entity tables that are too small for stable results.
Over-sharing sandbox workspaces.
Use roles and data access controls. See Share a workspace.
General pitfalls
These pitfalls show up in most projects:
Running full-scale jobs before a small validation run.
Skipping workspace validation/sync after schema changes. Use Validate and synchronize workspace.
Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with Manage foreign keys and virtual foreign keys.
Overusing Consistent mapping (it slows down data generation and increases linkability).
Governance, compliance, and automation
Use-case specific recommendations
Use strict roles: many Readers, very few Editors. Analysts should not change generators.
Use blue/green refresh for sandboxes. Automate refresh into the inactive schema, validate, then switch.
Publish a lightweight data dictionary and refresh timestamp with every refresh. Analysts need lineage to trust results.
Automate drift checks on key aggregates (top segments, null rates, distinct counts). Alert when the sandbox changes materially.
General recommendations
Use these recommendations for most workspaces.
Ownership and change control
Assign a single workspace owner (data steward / privacy lead / DBA).
Require a ticket or change request for generator changes.
Duplicate a workspace before large edits. Keep the previous version as rollback.
Access control
Default to read-only access for source connections.
Restrict who can view source data in the UI.
Use separate workspaces per environment or audience.
Automation (baseline)
Use the Syntho REST API to standardize scans and runs.
Automate data generation not workspace configuration.
Keep job logs for failed runs. This reduces back-and-forth during support.
Last updated
Was this helpful?

