Use Case 10: Data subsetting

Reduce data footprint while preserving integrity.

Use this use case when you need a smaller, representative dataset that still behaves like production.

What problem this use case solves

Teams need smaller non-production datasets. They want faster jobs and lower storage costs.

Working with full-size databases can be time-consuming. Large workloads can be infeasible without reducing scope.

When to choose this use case

Pick this when your non-prod dataset is too big to handle.

If you’re unsure, start with an entity-based subset (example: “recent customers”), then follow Configure subsetting and apply Mask to sensitive columns.

  • Full-size copies are too slow or too expensive.

  • You need a smaller dataset that still supports key joins.

  • You need faster refresh cycles for dev and test.

  • You can define deterministic subset criteria.

  • Run a PII scan on the retained slice to catch forgotten identifiers.

When to avoid this use case

Skip this when you cannot safely slice the dataset.

This setup is optimized for right-sizing non-production datasets. You reduce runtime and storage. You keep relationships so the subset remains usable.

1

Prerequisites

Checklist

circle-info

If your Syntho version shows “Coming soon” in the Subsetting UI, treat subsetting as a pre-step.

Create a smaller source dataset first (database-native subsetting, extraction, or limiting the scope of tables). Then run de-identification or generation on that reduced dataset.

chevron-rightOptional: if subsetting is “Coming soon” in your UIhashtag

You can still run the use case by preparing a subset in the database first:

  1. Pick one target entity (often customers, patients, or accounts).

  2. Extract a deterministic set of IDs (region, cohort, last activity).

  3. Copy the entity rows plus linked tables needed for your flows.

2

Source & destination management

Keep subsetting workspaces separate from full-copy de-identification workspaces.

Baseline rules

  • Keep the source stable. Prefer snapshots or back-ups.

  • Avoid a live production source for iterative work.

  • Keep the destination isolated. Never write into production.

  • Keep schemas aligned between source, workspace and destination.

  • Use views when you need only a subset of the original database.

Lifecycle rule of thumb

  • Keep the source connection when you expect schema changes.

  • Remove the source connection when you expect a new run only much later.

  • Revalidate after schema changes. Use Validate and synchronize workspace.

Nuances for this use case

  • Your destination size is part of the goal. Prefer a destination that enforces “smaller by design”.

  • Validate the FK graph before trusting the subset. Missing FKs cause silent data loss.

  • Don’t write subsets into schemas that already contain full tables. You’ll create mixed-scale datasets and confusing joins.

  • Don’t forget “dimension” tables. Otherwise you keep most of the database by accident.

3

Configure generators

Workspace initialization mode

Choose a workspace mode. It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

  • De-identify when the subset will still be production-like and must preserve joins.

  • Mock or mask all when you’re building a smaller dataset mainly for fast iteration and you don’t need original distributions.

AI-generated synthesis

Not usually the first choice for subsetting. Use it only when the subset becomes an analytics-style entity table and you want stronger unlinkability.

Example (subset + synthesize a shareable slice): subset to “last 90 days orders”, then synthesize a flattened orders_entity_view so the smaller dataset can be shared internally without row-level links.

Rule-based generation

Use this to enforce subset properties (coverage, deterministic criteria) and to add test-friendly flags. Use Calculated columns for deterministic labeling.

Example (deterministic cohort label): tag rows so teams can validate “this subset still matches our criteria”.

Masking

This is the common path: keep the smaller dataset production-like, but remove identifiers while preserving joins.

Example (subset stays relational): de-identify customers + linked tables, mask email and phone, and enable consistent mapping for customer_id so foreign keys remain valid in the reduced dataset.

Hybrid

Use this when you want a relational subset for testing, plus an analysis-friendly table for convenience.

Example (two outputs, two purposes):

  1. Keep the relational subset for app testing (de-identify + consistent mapping for join keys).

  2. Build a flattened subset_summary_view for analysis.

  3. AI synthesize the flattened view for stronger unlinkability (single-table sweet spot).

If you need a stable “subset label” in the flattened view, add it with a calculated column:

Minimal configuration steps

  1. Define the subset criteria (entity + selection rule).

  2. Keep required linked tables to preserve joins.

  3. Apply de-identification/masking to the retained slice.

  4. Validate joins with real application queries.

4

Handle keys and relationships (relational schemas)

If your reduced dataset is one table only, you can skip this step.

Subsets fail on missing relationships. Fix foreign keys before you trust the slice.

Use foreign key inheritance. Add virtual foreign keys when the database doesn’t define them. This ensures linked tables are pulled correctly.

5

Validate and sync

Validate the subset with real application queries. Confirm that joins return expected results.

If you iterate on the schema or FK graph, re-run validation in Validate and synchronize workspace before the next extraction.

6

Tune generation settings

Right-size early and iterate fast. This is the main ROI of subsetting.

Use View and adjust generation settings and Large workloads tuning when your subset job becomes the bottleneck.

Common pitfalls & misconfigurations

Use-case specific pitfalls

  • Expecting “5% of target table” to equal 5% of the full database.

chevron-rightGeneral pitfallshashtag

These pitfalls show up in most projects:

chevron-rightGovernance, compliance, and automationhashtag

Governance, access control, and audit evidence

Keep the workspace configuration as a controlled artifact. Treat it like “test data release”.

  • Workspace Owner: data steward or privacy lead. Approves generator choices and sharing.

  • Workspace Editor: data engineer or platform engineer. Implements configuration changes.

  • Workspace Reader: testers, analysts, or trainees. Can run jobs but should not change rules.

See Workspace & user management and Share a workspace.

Access control checklist

  • Use read-only access to the source database for day-to-day users.

  • Restrict who can view source data in the UI. Don’t default to broad access.

  • Use a dedicated destination per environment (dev, test, accept, sandbox).

  • Keep external recipients in a separate workspace with stricter settings.

Evidence for auditors (lightweight but useful)

Capture these items per delivery or refresh:

  • Workspace name, owner, and intended audience.

  • PII scan results and the final list of “PII columns + applied generator type”.

  • Any enabled privacy controls (e.g., rare category protection, free-text de-identification scope).

  • Validation output and/or QA report (when applicable).

  • Approval notes (ticket link, privacy board sign-off, or risk acceptance).

Automation and deployment (reference)

You can automate workspace setup, scans, and generation runs via the Syntho REST API.

Last updated

Was this helpful?