# Use Case 4: ETL & data pipeline testing

Use this use case when you need an end-to-end test dataset for pipelines. Focus on correctness across tables and across systems.

### What problem this use case solves

Teams need to validate transformations and integrations. They need stable keys, relationships, and formats.

Classic anonymization can over-generalize values and reduce realism. In relational databases, you also need explicit handling of keys and foreign keys to keep referential integrity.

### When to choose this use case

Pick this when you test data pipelines, not just tables.

If you’re unsure, start with **de-identify** and enable [Consistent mapping](/configure-a-data-generation-job/configure-column-settings/consistent-mapping.md) for join keys. Then run [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).

* You validate ETL/ELT transformations end-to-end.
* Your tests rely on stable joins and key coverage.
* Your pipeline runs on a cadence (CI, nightly, releases).
* Downstream systems validate formats and constraints.
* Make joins explicit with [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md).

### When to avoid this use case

Skip this when data pipelines and ETL testing are not the target.

* You only test one curated table with no joins. Use [Use Case 1: Application & API Testing](/overview/get-started/use-cases-and-configuration/use-case-1-application-and-api-testing.md).
* You mainly need performance and volume testing. Use [Use Case 2: Load & Stress](/overview/get-started/use-cases-and-configuration/use-case-2-load-and-stress-testing.md).
* You validate workflows during a migration between platforms. Use [Use Case 8: Cloud & Data Migration](/overview/get-started/use-cases-and-configuration/use-case-8-cloud-and-data-migration.md).

### Recommended Syntho configuration

This setup is optimized for **join-correct, repeatable pipeline runs**. Your goal is functional correctness across transforms. Broken keys are a test failure.

{% stepper %}
{% step %}

#### Prerequisites

**Checklist**

* [ ] Pipeline checkpoints listed (counts, joins, null rules).
* [ ] Source snapshot fixed for the run.
* [ ] Key columns identified (business keys + join keys).
* [ ] PII handling decided at the right stage (ingest vs curated).

{% hint style="info" %}
Aim for repeatability first. Optimize later.
{% endhint %}

* Use the [Prerequisites](/overview/get-started/prerequisites.md) checklist.
  {% endstep %}

{% step %}

#### Source & destination management

Create one workspace per pipeline (or per environment). This keeps generator changes traceable to pipeline changes.

#### Baseline rules

* Keep the **source stable**. Prefer snapshots or back-ups.
* Avoid a **live production** source for iterative work.
* Keep the **destination isolated**. Never write into production.
* Keep **schemas aligned** between source, workspace and destination.
* Use **views** when you need only a subset of the original database.

#### Lifecycle rule of thumb

* Keep the source connection when you expect schema changes.
* Remove the source connection when you expect a new run only much later.
* Revalidate after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).

**Nuances for this use case**

* If you validate row-level equality, freeze the source snapshot. Don’t regenerate from moving extracts.
* If you change a view definition, resync before blaming the pipeline. Otherwise tests fail for the wrong reason.
* Don’t write test inputs into schemas used by prod-like data. Keep a dedicated namespace for pipeline test inputs.
* [Create a workspace](/setup-workspaces/create-a-workspace.md)
  {% endstep %}

{% step %}

#### Configure generators

**Workspace initialization mode**

Choose a [workspace mode](/setup-workspaces/create-a-workspace/workspace-modes.md). It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

* **De-identify** when you want stable row-level behavior and predictable outputs for assertions.
* **Mock or mask all** when you want to remove more than just PII but still keep formats stable.
* **From scratch** when you only test a small number of pipeline-relevant tables.

**AI-generated synthesis**

Usually not the default for pipeline testing. It can change row-level behavior, which makes deterministic assertions harder.

**Example (non-assertive smoke runs):** synthesize a single `staging_events_view` to generate a larger, privacy-safe stream and validate pipeline robustness (parsing, scaling). Avoid using it for exact record-level checks.

**Rule-based generation**

Use this for **known-good** and **known-bad** rows. Use [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md) to inject boundary cases deterministically.

**Example (type-conversion boundary):** add a calculated override that sets `amount = -1` for `BAD_ROW_FLAG` rows, and `amount = 0` for another slice. Assert the pipeline rejects or routes them correctly.

```excel-formula
// New column: BAD_ROW_FLAG (≈0.5% of rows)
RAND() < 0.005
```

```excel-formula
// Override: amount (negative values trigger validation logic)
IF([BAD_ROW_FLAG], -ABS([amount]), [amount])
```

**Masking**

Use this when you need **format-preserving replacements** while keeping stable joins across stages and systems.

**Example (stable business keys):** enable **Consistent mapping** for `customer_id` and `order_id`, then mask `email` and `iban` so joins remain stable from ingest → curated, without leaking PII.

**Hybrid**

Use this when you need stable joins (mask/de-identify) plus **rule-driven guarantees** for transformation correctness. It maps to “absolute calculations” from [Example data generation scenarios](/overview/get-started/syntho-bootcamp/example-data-generation-scenarios.md).

**Example (pipeline invariant always holds):** if your pipeline derives `net_amount = gross_amount - tax_amount`, enforce that invariant in the input so you can test transformation drift.

```excel-formula
// Override: net_amount (guaranteed identity)
[gross_amount] - [tax_amount]
```

**Minimal configuration steps**

1. Run a [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) on the pipeline inputs.
2. Set join keys to stable handling (key generators + consistent mapping where needed).
3. Add calculated-column assertions for the transformations you care about.
4. Validate on a small slice before scaling.

{% hint style="warning" %}
If joins break, you are no longer testing the pipeline. Fix PK/FK first.
{% endhint %}

* [Automatic PII discovery with PII scanner](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md)
* [Manage personally identifiable information (PII)](/configure-a-data-generation-job/manage-personally-identifiable-information-pii.md)

<details>

<summary>Concrete example: creating “known bad” rows for pipeline assertions</summary>

Use a dedicated workspace when you need intentional “bad data”. Example: `etl-negative-cases`.

If [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md) are available in your version, inject a small percentage of erroneous rows:

```excel-formula
// New column: BAD_ROW_FLAG (≈0.5% of rows)
RAND() < 0.005
```

```excel-formula
// Example override: postal_code (invalid format to test parsing)
IF([BAD_ROW_FLAG], "XX-INVALID", [postal_code])
```

```excel-formula
// Example override: amount (negative to test validation rules)
IF([BAD_ROW_FLAG], -ABS([amount]), [amount])
```

</details>
{% endstep %}

{% step %}

#### Handle keys and relationships (relational schemas)

If your pipeline test is based on a **single curated table** (no joins), you can skip this step.

Make foreign keys explicit before the first full run. Otherwise you test the wrong join behavior.

Use [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md). Add [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md) where the source schema is incomplete.

* [Key generators](/configure-a-data-generation-job/configure-column-settings/key-generators.md)

If you don’t have FKs in the database, start with the [foreign key scanner](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/use-foreign-key-scanner.md). Then validate key coverage on real joins.
{% endstep %}

{% step %}

#### Validate and sync

Validate early on a subset. Confirm row counts and join cardinalities at key stages.

Re-run validation whenever schemas change. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md). Pipeline tests depend on schema stability.

* [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md)

Also validate *intermediate outputs* in the pipeline:

* Snapshot row counts per stage (ingest → staging → curated).
* Compare distinct counts of business keys after dedup steps.
* Assert null-rate expectations for critical columns.
  {% endstep %}

{% step %}

#### Tune generation settings

Tune for repeatable runtime and stable write behavior. Pipeline tests often run in CI/CD.

Use [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md) after the join graph is correct.

* [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md)
  {% endstep %}
  {% endstepper %}

### Common pitfalls & misconfigurations

#### Use case-specific pitfalls

* Testing transformations without including representative edge-case inputs.

<details>

<summary>General pitfalls</summary>

These pitfalls show up in most projects:

* Running full-scale jobs before a small validation run.
* Skipping workspace validation/sync after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).
* Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md) and [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md).
* Leaving sensitive columns on [**Duplicate**](/configure-a-data-generation-job/configure-column-settings/duplicate.md), or trusting the [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) without reviewing false positives/negatives.
* Overusing [**Consistent mapping**](/configure-a-data-generation-job/configure-column-settings/consistent-mapping.md) (it slows down data generation and increases linkability).

</details>

### Governance, compliance, and automation

#### Use case-specific recommendations

* Align workspace versions with pipeline versions (example: `etl_orders_v3`). Don’t reuse a workspace across major pipeline rewrites.
* Automate pipeline checks against the generated dataset (row counts, PK uniqueness, null-rate expectations, join cardinalities).
* Keep join-key strategy explicit and reviewed (which keys are duplicated/hashed/generated). Document it with the pipeline test plan.
* If you inject “known bad rows”, keep the flags and percentages stable across runs. Otherwise tests become flaky.

<details>

<summary>General recommendations</summary>

Use these recommendations for most workspaces.

#### Ownership and change control

* Assign a single **workspace owner** (data steward / privacy lead / DBA).
* Require a ticket or change request for generator changes.
* Duplicate a workspace before large edits. Keep the previous version as rollback.

#### Access control

* Default to **read-only** access for source connections.
* Restrict **who can view source data** in the UI.
* Use separate workspaces per environment or audience.

#### Automation (baseline)

* Use the [Syntho REST API](/syntho-api/syntho-rest-api.md) to standardize scans and runs.
* Automate data generation not workspace configuration.
* Keep job logs for failed runs. This reduces back-and-forth during support.

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/use-cases-and-configuration/use-case-4-etl-and-data-pipeline-testing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
