# Use Case 10: Data subsetting

Use this use case when you need a smaller, representative dataset that still behaves like production.

### What problem this use case solves

Teams need smaller non-production datasets. They want faster jobs and lower storage costs.

Working with full-size databases can be time-consuming. Large workloads can be infeasible without reducing scope.

### When to choose this use case

Pick this when your non-prod dataset is too big to handle.

If you’re unsure, start with an entity-based subset (example: “recent customers”), then follow [Configure subsetting](/subsetting/configure-subsetting.md) and apply [Mask](/configure-a-data-generation-job/configure-column-settings/mask.md) to sensitive columns.

* Full-size copies are too slow or too expensive.
* You need a smaller dataset that still supports key joins.
* You need faster refresh cycles for dev and test.
* You can define deterministic subset criteria.
* Run a [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) on the retained slice to catch forgotten identifiers.

### When to avoid this use case

Skip this when you cannot safely slice the dataset.

* You need all tables and all rows (full copy), or you can’t define subset rules without breaking flows. Use [Use Case 1: Application & API Testing](/overview/get-started/use-cases-and-configuration/use-case-1-application-and-api-testing.md).
* You need to increase row counts (upsampling). Use [Use Case 2: Load & Stress](/overview/get-started/use-cases-and-configuration/use-case-2-load-and-stress-testing.md).
* You need strict statistical fidelity for modeling. Use [Use Case 6: ML Model Development](/overview/get-started/use-cases-and-configuration/use-case-6-ml-model-development.md).

### Recommended Syntho configuration

This setup is optimized for **right-sizing non-production datasets**. You reduce runtime and storage. You keep relationships so the subset remains usable.

{% stepper %}
{% step %}

#### Prerequisites

**Checklist**

* [ ] Target size and acceptance criteria defined.
* [ ] Subset criteria is deterministic (cohort, timeframe, region).
* [ ] FK graph is understood (what must be retained to keep joins).
* [ ] PII handling decided for the retained slice.

- Use the [Prerequisites](/overview/get-started/prerequisites.md) checklist.

{% hint style="info" %}
If your Syntho version shows “Coming soon” in the Subsetting UI, treat subsetting as a **pre-step**.

Create a smaller source dataset first (database-native subsetting, extraction, or limiting the scope of tables). Then run de-identification or generation on that reduced dataset.
{% endhint %}

<details>

<summary>Optional: if subsetting is “Coming soon” in your UI</summary>

You can still run the use case by preparing a subset in the database first:

1. Pick one target entity (often `customers`, `patients`, or `accounts`).
2. Extract a deterministic set of IDs (region, cohort, last activity).
3. Copy the entity rows plus linked tables needed for your flows.

</details>
{% endstep %}

{% step %}

#### Source & destination management

Keep subsetting workspaces separate from full-copy de-identification workspaces.

#### Baseline rules

* Keep the **source stable**. Prefer snapshots or back-ups.
* Avoid a **live production** source for iterative work.
* Keep the **destination isolated**. Never write into production.
* Keep **schemas aligned** between source, workspace and destination.
* Use **views** when you need only a subset of the original database.

#### Lifecycle rule of thumb

* Keep the source connection when you expect schema changes.
* Remove the source connection when you expect a new run only much later.
* Revalidate after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).

**Nuances for this use case**

* Your destination size is part of the goal. Prefer a destination that enforces “smaller by design”.
* Validate the FK graph before trusting the subset. Missing FKs cause silent data loss.
* Don’t write subsets into schemas that already contain full tables. You’ll create mixed-scale datasets and confusing joins.
* Don’t forget “dimension” tables. Otherwise you keep most of the database by accident.
* [Create a workspace](/setup-workspaces/create-a-workspace.md)
  {% endstep %}

{% step %}

#### Configure generators

**Workspace initialization mode**

Choose a [workspace mode](/setup-workspaces/create-a-workspace/workspace-modes.md). It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

* **De-identify** when the subset will still be production-like and must preserve joins.
* **Mock or mask all** when you’re building a smaller dataset mainly for fast iteration and you don’t need original distributions.

**AI-generated synthesis**

Not usually the first choice for subsetting. Use it only when the subset becomes an **analytics-style entity table** and you want stronger unlinkability.

**Example (subset + synthesize a shareable slice):** subset to “last 90 days orders”, then synthesize a flattened `orders_entity_view` so the smaller dataset can be shared internally without row-level links.

**Rule-based generation**

Use this to enforce subset properties (coverage, deterministic criteria) and to add test-friendly flags. Use [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md) for deterministic labeling.

**Example (deterministic cohort label):** tag rows so teams can validate “this subset still matches our criteria”.

```excel-formula
// New column: subset_cohort (based on recency)
IF([last_login_date] >= DATEADD(TODAY(), -90, "day"), "RECENT_ACTIVE", "OLDER")
```

**Masking**

This is the common path: keep the smaller dataset production-like, but remove identifiers while preserving joins.

**Example (subset stays relational):** de-identify `customers` + linked tables, mask `email` and `phone`, and enable consistent mapping for `customer_id` so foreign keys remain valid in the reduced dataset.

**Hybrid**

Use this when you want a relational subset for testing, plus an analysis-friendly table for convenience.

**Example (two outputs, two purposes):**

1. Keep the relational subset for app testing (de-identify + consistent mapping for join keys).
2. Build a flattened `subset_summary_view` for analysis.
3. AI synthesize the flattened view for stronger unlinkability (single-table sweet spot).

If you need a stable “subset label” in the flattened view, add it with a calculated column:

```excel-formula
// New column: subset_label
IF([order_date] >= DATEADD(TODAY(), -30, "day"), "RECENT_30D", "HISTORIC")
```

**Minimal configuration steps**

1. Define the subset criteria (entity + selection rule).
2. Keep required linked tables to preserve joins.
3. Apply de-identification/masking to the retained slice.
4. Validate joins with real application queries.

* [Automatic PII discovery with PII scanner](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md)
* [Manage personally identifiable information (PII)](/configure-a-data-generation-job/manage-personally-identifiable-information-pii.md)
  {% endstep %}

{% step %}

#### Handle keys and relationships (relational schemas)

If your reduced dataset is **one table only**, you can skip this step.

Subsets fail on missing relationships. Fix foreign keys before you trust the slice.

Use [foreign key inheritance](/configure-a-data-generation-job/manage-foreign-keys/foreign-key-inheritance.md). Add [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md) when the database doesn’t define them. This ensures linked tables are pulled correctly.

* [Verify foreign keys](/subsetting/verify-foreign-keys.md)
* [Key generators](/configure-a-data-generation-job/configure-column-settings/key-generators.md)
  {% endstep %}

{% step %}

#### Validate and sync

Validate the subset with real application queries. Confirm that joins return expected results.

If you iterate on the schema or FK graph, re-run validation in [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md) before the next extraction.
{% endstep %}

{% step %}

#### Tune generation settings

Right-size early and iterate fast. This is the main ROI of subsetting.

Use [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md) and [Large workloads](/overview/get-started/syntho-bootcamp/9.-large-workloads.md) tuning when your subset job becomes the bottleneck.

* [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md)
* [Large workloads](/overview/get-started/syntho-bootcamp/9.-large-workloads.md)
  {% endstep %}
  {% endstepper %}

### Common pitfalls & misconfigurations

#### Use-case specific pitfalls

* Expecting “5% of target table” to equal 5% of the full database.

<details>

<summary>General pitfalls</summary>

These pitfalls show up in most projects:

* Running full-scale jobs before a small validation run.
* Skipping workspace validation/sync after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).
* Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md) and [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md).
* Leaving sensitive columns on [**Duplicate**](/configure-a-data-generation-job/configure-column-settings/duplicate.md), or trusting the [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) without reviewing false positives/negatives.
* Overusing [**Consistent mapping**](/configure-a-data-generation-job/configure-column-settings/consistent-mapping.md) (it slows down data generation and increases linkability).

</details>

<details>

<summary>Governance, compliance, and automation</summary>

#### Governance, access control, and audit evidence

Keep the workspace configuration as a controlled artifact. Treat it like “test data release”.

#### Recommended roles

* **Workspace Owner**: data steward or privacy lead. Approves generator choices and sharing.
* **Workspace Editor**: data engineer or platform engineer. Implements configuration changes.
* **Workspace Reader**: testers, analysts, or trainees. Can run jobs but should not change rules.

See [Workspace & user management](/overview/get-started/syntho-bootcamp/8.-workspace-and-user-management.md) and [Share a workspace](/setup-workspaces/share-a-workspace.md).

#### Access control checklist

* Use **read-only** access to the **source** database for day-to-day users.
* Restrict **who can view source data** in the UI. Don’t default to broad access.
* Use a **dedicated destination** per environment (`dev`, `test`, `accept`, `sandbox`).
* Keep external recipients in a **separate workspace** with stricter settings.

#### Evidence for auditors (lightweight but useful)

Capture these items per delivery or refresh:

* Workspace name, owner, and intended audience.
* PII scan results and the final list of “PII columns + applied generator type”.
* Any enabled privacy controls (e.g., rare category protection, free-text de-identification scope).
* Validation output and/or QA report (when applicable).
* Approval notes (ticket link, privacy board sign-off, or risk acceptance).

#### Automation and deployment (reference)

You can automate workspace setup, scans, and generation runs via the [Syntho REST API](/syntho-api/syntho-rest-api.md).

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/use-cases-and-configuration/use-case-10-data-subsetting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
