Use Case 7: Analytics sandboxes

Secure sandboxes for exploratory analytics and data science.

Use this use case when analysts need access to data while keeping privacy risk controlled. The focus is utility for exploration with privacy-by-design controls.

What problem this use case solves

Teams need safe environments for exploration. They need distributions and correlations that remain useful.

Classic anonymization can remove detail and distort distributions. It can also require multiple iterations to meet privacy needs.

When to choose this use case

Pick this when analysts need access, but production is restricted.

If you’re unsure, start with Synthesize all on a curated entity table and restrict access via Share a workspace.

  • You need exploration and BI dashboards without production access.

  • You need correlations and distributions to stay useful.

  • Many users need the same refreshable dataset.

  • You need access control and privacy controls built-in.

  • Use De-identify when analysts need production-like multi-table joins.

When to avoid this use case

Skip this when exploration is not the goal.

This setup is optimized for exploration with controlled privacy risk. You preserve correlations and distributions. You reduce re-identification risk through privacy controls and access control.

1

Prerequisites

Checklist

2

Source & destination management

Create one workspace per audience or policy. Example: sandbox-analysts vs sandbox-data-science.

Baseline rules

  • Keep the source stable. Prefer snapshots or back-ups.

  • Avoid a live production source for iterative work.

  • Keep the destination isolated. Never write into production.

  • Keep schemas aligned between source, workspace and destination.

  • Use views when you need only a subset of the original database.

Lifecycle rule of thumb

  • Keep the source connection when you expect schema changes.

  • Remove the source connection when you expect a new run only much later.

  • Revalidate after schema changes. Use Validate and synchronize workspace.

Nuances for this use case

  • Use roles and sharing to enforce access boundaries. Only a small group should change generators.

  • Prefer blue/green schemas for refresh. Avoid breaking dashboards mid-refresh.

  • Avoid in-place refreshes. Users see partial data and inconsistent aggregates.

  • Don’t hide join keys in source views. Analysts will rebuild them manually and create privacy risk.

3

Configure generators

Workspace initialization mode

Choose a workspace mode. It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

  • Synthesize all when you can provide an entity table and you want strong utility for exploration.

  • De-identify when analysts need multi-table joins that behave like production (and you mainly replace identifiers).

  • Mock or mask all for “safe-by-default” sandboxes with minimal dependency on the original distributions.

AI-generated synthesis

Use this when analysts need correlations and distributions to stay useful for exploration.

Example (BI-ready entity view): create sandbox_sales_entity_view (customer segment, channel, order totals), then AI synthesize it into a single fact table for dashboards without exposing production data.

Rule-based generation

Use this when you must enforce reporting conventions or guarantee certain slices exist for dashboards. Use Calculated columns to keep dashboards stable.

Example (stable “last 30 days” charts): add a calculated EDGE_RECENT flag (e.g., 20% of rows), then set order_date to a random value in the last 30 days when EDGE_RECENT is true. This avoids empty “recent activity” charts after refresh.

Masking

Use this when BI tooling expects format-valid codes or when analysts need stable join keys in a de-identified relational sandbox.

Example (stable dimension joins): de-identify identifiers, keep consistent mapping for product_id, and mask postal_code to valid formats so geography dashboards and joins behave predictably.

Hybrid

Use this when you need both utility and operational stability for many users.

Example (hierarchy correctness for geography dashboards): enforce “city → province → country” in a dimension table (matches the “hierarchical relationship” scenario).

Minimal configuration steps

  1. Create one curated entity view (BI-friendly).

  2. Prefer AI synthesize for the entity table.

  3. Apply masking/de-identification for identifiers that remain.

chevron-rightOptional: BI-friendly dataset shapehashtag
  • Prefer one fact table + small dimensions.

  • Keep common filters (country, segment, channel).

  • Avoid raw free-text unless needed.

  • If you want one query-friendly table, create a view first. See Use SQL views as input tables.

4

Handle keys and relationships (relational schemas)

If you publish a single sandbox table (no joins), you can skip this step.

If the sandbox needs joins, make FKs explicit. Analysts will join tables in unpredictable ways.

If you do not need joins, flatten into an entity table before synthesis. This reduces privacy risk and simplifies validation.

5

Validate and sync

Use the QA report when available to validate utility and privacy.

Revalidate after each refresh cycle. Sandbox users notice drift quickly.

6

Tune generation settings

Tune for interactive performance. Sandboxes are query-heavy.

Apply Additional privacy controls before widening access to more users.

Use View and adjust generation settings when query latency becomes the bottleneck.

Refresh and rollback strategy (low disruption)

Avoid breaking dashboards during refreshes:

  • Keep two destination schemas: sandbox_blue and sandbox_green.

  • Refresh the inactive schema, validate dashboards, then switch BI connections.

  • If something breaks, roll back by switching back to the previous schema.

Common pitfalls & misconfigurations

Use-case specific pitfalls

  • Publishing sandboxes that still contain sensitive identifiers.

  • Using entity tables that are too small for stable results.

  • Over-sharing sandbox workspaces.

chevron-rightGeneral pitfallshashtag

These pitfalls show up in most projects:

Governance, compliance, and automation

Use-case specific recommendations

  • Use strict roles: many Readers, very few Editors. Analysts should not change generators.

  • Use blue/green refresh for sandboxes. Automate refresh into the inactive schema, validate, then switch.

  • Publish a lightweight data dictionary and refresh timestamp with every refresh. Analysts need lineage to trust results.

  • Automate drift checks on key aggregates (top segments, null rates, distinct counts). Alert when the sandbox changes materially.

chevron-rightGeneral recommendationshashtag

Use these recommendations for most workspaces.

Ownership and change control

  • Assign a single workspace owner (data steward / privacy lead / DBA).

  • Require a ticket or change request for generator changes.

  • Duplicate a workspace before large edits. Keep the previous version as rollback.

Access control

  • Default to read-only access for source connections.

  • Restrict who can view source data in the UI.

  • Use separate workspaces per environment or audience.

Automation (baseline)

  • Use the Syntho REST API to standardize scans and runs.

  • Automate data generation not workspace configuration.

  • Keep job logs for failed runs. This reduces back-and-forth during support.

Last updated

Was this helpful?