# Use Case 7: Analytics sandboxes

Use this use case when analysts need access to data while keeping privacy risk controlled. The focus is utility for exploration with privacy-by-design controls.

### What problem this use case solves

Teams need safe environments for exploration. They need distributions and correlations that remain useful.

Classic anonymization can remove detail and distort distributions. It can also require multiple iterations to meet privacy needs.

### When to choose this use case

Pick this when analysts need access, but production is restricted.

If you’re unsure, start with **Synthesize all** on a curated entity table and restrict access via [Share a workspace](/setup-workspaces/share-a-workspace.md).

* You need exploration and BI dashboards without production access.
* You need correlations and distributions to stay useful.
* Many users need the same refreshable dataset.
* You need access control and privacy controls built-in.
* Use **De-identify** when analysts need production-like multi-table joins.

### When to avoid this use case

Skip this when exploration is not the goal.

* You need strict rule compliance for every row. Use [Use Case 4: ETL & Data Pipeline Testing](/overview/get-started/use-cases-and-configuration/use-case-4-etl-and-data-pipeline-testing.md).
* You need feature datasets for ML training and evaluation. Use [Use Case 6: ML Model Development](/overview/get-started/use-cases-and-configuration/use-case-6-ml-model-development.md).
* You need external data sharing with approvals and evidence. Use [Use Case 9: Data Sharing & Monetization](/overview/get-started/use-cases-and-configuration/use-case-9-data-sharing-and-monetization.md).
* You mainly need load and stress testing at scale. Focus on volume profiles and throughput tuning instead of analyst UX.

### Recommended Syntho configuration

This setup is optimized for **exploration with controlled privacy risk**. You preserve correlations and distributions. You reduce re-identification risk through privacy controls and access control.

{% stepper %}
{% step %}

#### Prerequisites

**Checklist**

* [ ] Audience and access policy defined (analysts vs data science).
* [ ] Refresh cadence defined (daily/weekly).
* [ ] Destination schema strategy chosen (blue/green if needed).

- Use the [Prerequisites](/overview/get-started/prerequisites.md) checklist.
  {% endstep %}

{% step %}

#### Source & destination management

Create one workspace per audience or policy. Example: `sandbox-analysts` vs `sandbox-data-science`.

#### Baseline rules

* Keep the **source stable**. Prefer snapshots or back-ups.
* Avoid a **live production** source for iterative work.
* Keep the **destination isolated**. Never write into production.
* Keep **schemas aligned** between source, workspace and destination.
* Use **views** when you need only a subset of the original database.

#### Lifecycle rule of thumb

* Keep the source connection when you expect schema changes.
* Remove the source connection when you expect a new run only much later.
* Revalidate after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).

**Nuances for this use case**

* Use roles and sharing to enforce access boundaries. Only a small group should change generators.
* Prefer blue/green schemas for refresh. Avoid breaking dashboards mid-refresh.
* Avoid in-place refreshes. Users see partial data and inconsistent aggregates.
* Don’t hide join keys in source views. Analysts will rebuild them manually and create privacy risk.
* [Create a workspace](/setup-workspaces/create-a-workspace.md)
* [Share a workspace](/setup-workspaces/share-a-workspace.md)
  {% endstep %}

{% step %}

#### Configure generators

**Workspace initialization mode**

Choose a [workspace mode](/setup-workspaces/create-a-workspace/workspace-modes.md). It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

* **Synthesize all** when you can provide an entity table and you want strong utility for exploration.
* **De-identify** when analysts need multi-table joins that behave like production (and you mainly replace identifiers).
* **Mock or mask all** for “safe-by-default” sandboxes with minimal dependency on the original distributions.

**AI-generated synthesis**

Use this when analysts need **correlations and distributions** to stay useful for exploration.

**Example (BI-ready entity view):** create `sandbox_sales_entity_view` (customer segment, channel, order totals), then AI synthesize it into a single fact table for dashboards without exposing production data.

**Rule-based generation**

Use this when you must enforce **reporting conventions** or guarantee certain slices exist for dashboards. Use [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md) to keep dashboards stable.

**Example (stable “last 30 days” charts):** add a calculated `EDGE_RECENT` flag (e.g., 20% of rows), then set `order_date` to a random value in the last 30 days when `EDGE_RECENT` is true. This avoids empty “recent activity” charts after refresh.

```excel-formula
// New column: EDGE_RECENT (≈20% of rows)
RAND() < 0.20
```

```excel-formula
// Override: order_date (force recent rows for dashboards)
IF([EDGE_RECENT], DATEADD(TODAY(), -RANDBETWEEN(0, 30), "day"), [order_date])
```

**Masking**

Use this when BI tooling expects **format-valid codes** or when analysts need stable join keys in a de-identified relational sandbox.

**Example (stable dimension joins):** de-identify identifiers, keep consistent mapping for `product_id`, and mask `postal_code` to valid formats so geography dashboards and joins behave predictably.

**Hybrid**

Use this when you need both **utility** and **operational stability** for many users.

**Example (hierarchy correctness for geography dashboards):** enforce “city → province → country” in a dimension table (matches the “hierarchical relationship” scenario).

```excel-formula
// New column: province (derived from city)
SWITCH(UPPER(TRIM([city])),
  "TORONTO",  "ONTARIO",
  "MONTREAL", "QUEBEC",
  "VANCOUVER","BRITISH_COLUMBIA",
  "OTHER"
)
```

**Minimal configuration steps**

1. Create one curated entity view (BI-friendly).
2. Prefer **AI synthesize** for the entity table.
3. Apply masking/de-identification for identifiers that remain.

* [Automatic PII discovery with PII scanner](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md)
* [Manage personally identifiable information (PII)](/configure-a-data-generation-job/manage-personally-identifiable-information-pii.md)

<details>

<summary>Optional: BI-friendly dataset shape</summary>

* Prefer one fact table + small dimensions.
* Keep common filters (`country`, `segment`, `channel`).
* Avoid raw free-text unless needed.
* If you want one query-friendly table, create a view first. See [Use SQL views as input tables](/setup-workspaces/create-a-workspace/use-sql-views-as-input-tables.md).

</details>
{% endstep %}

{% step %}

#### Handle keys and relationships (relational schemas)

If you publish a **single sandbox table** (no joins), you can skip this step.

If the sandbox needs joins, make FKs explicit. Analysts will join tables in unpredictable ways.

If you do not need joins, flatten into an entity table before synthesis. This reduces privacy risk and simplifies validation.

* [Key generators](/configure-a-data-generation-job/configure-column-settings/key-generators.md)
  {% endstep %}

{% step %}

#### Validate and sync

Use the [QA report](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/qa-report.md) when available to validate utility and privacy.

Revalidate after each refresh cycle. Sandbox users notice drift quickly.
{% endstep %}

{% step %}

#### Tune generation settings

Tune for interactive performance. Sandboxes are query-heavy.

Apply [Additional privacy controls](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/privacy-controls.md) before widening access to more users.

Use [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md) when query latency becomes the bottleneck.

**Refresh and rollback strategy (low disruption)**

Avoid breaking dashboards during refreshes:

* Keep two destination schemas: `sandbox_blue` and `sandbox_green`.
* Refresh the inactive schema, validate dashboards, then switch BI connections.
* If something breaks, roll back by switching back to the previous schema.
  {% endstep %}
  {% endstepper %}

### Common pitfalls & misconfigurations

#### Use-case specific pitfalls

* Publishing sandboxes that still contain sensitive identifiers.
* Using entity tables that are too small for stable results.
* Over-sharing sandbox workspaces.
  * Use roles and data access controls. See [Share a workspace](/setup-workspaces/share-a-workspace.md).

<details>

<summary>General pitfalls</summary>

These pitfalls show up in most projects:

* Running full-scale jobs before a small validation run.
* Skipping workspace validation/sync after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).
* Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md) and [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md).
* Leaving sensitive columns on [**Duplicate**](/configure-a-data-generation-job/configure-column-settings/duplicate.md), or trusting the [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) without reviewing false positives/negatives.
* Overusing [**Consistent mapping**](/configure-a-data-generation-job/configure-column-settings/consistent-mapping.md) (it slows down data generation and increases linkability).

</details>

### Governance, compliance, and automation

#### Use-case specific recommendations

* Use strict roles: many **Readers**, very few **Editors**. Analysts should not change generators.
* Use blue/green refresh for sandboxes. Automate refresh into the inactive schema, validate, then switch.
* Publish a lightweight data dictionary and refresh timestamp with every refresh. Analysts need lineage to trust results.
* Automate drift checks on key aggregates (top segments, null rates, distinct counts). Alert when the sandbox changes materially.

<details>

<summary>General recommendations</summary>

Use these recommendations for most workspaces.

#### Ownership and change control

* Assign a single **workspace owner** (data steward / privacy lead / DBA).
* Require a ticket or change request for generator changes.
* Duplicate a workspace before large edits. Keep the previous version as rollback.

#### Access control

* Default to **read-only** access for source connections.
* Restrict **who can view source data** in the UI.
* Use separate workspaces per environment or audience.

#### Automation (baseline)

* Use the [Syntho REST API](/syntho-api/syntho-rest-api.md) to standardize scans and runs.
* Automate data generation not workspace configuration.
* Keep job logs for failed runs. This reduces back-and-forth during support.

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/use-cases-and-configuration/use-case-7-analytics-sandboxes.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
