Use Case 4: ETL & data pipeline testing

End-to-end pipeline testing without using production data.

Use this use case when you need an end-to-end test dataset for pipelines. Focus on correctness across tables and across systems.

What problem this use case solves

Teams need to validate transformations and integrations. They need stable keys, relationships, and formats.

Classic anonymization can over-generalize values and reduce realism. In relational databases, you also need explicit handling of keys and foreign keys to keep referential integrity.

When to choose this use case

Pick this when you test data pipelines, not just tables.

If you’re unsure, start with de-identify and enable Consistent mapping for join keys. Then run Validate and synchronize workspace.

  • You validate ETL/ELT transformations end-to-end.

  • Your tests rely on stable joins and key coverage.

  • Your pipeline runs on a cadence (CI, nightly, releases).

  • Downstream systems validate formats and constraints.

  • Make joins explicit with Manage foreign keys.

When to avoid this use case

Skip this when data pipelines and ETL testing are not the target.

This setup is optimized for join-correct, repeatable pipeline runs. Your goal is functional correctness across transforms. Broken keys are a test failure.

1

Prerequisites

Checklist

circle-info

Aim for repeatability first. Optimize later.

2

Source & destination management

Create one workspace per pipeline (or per environment). This keeps generator changes traceable to pipeline changes.

Baseline rules

  • Keep the source stable. Prefer snapshots or back-ups.

  • Avoid a live production source for iterative work.

  • Keep the destination isolated. Never write into production.

  • Keep schemas aligned between source, workspace and destination.

  • Use views when you need only a subset of the original database.

Lifecycle rule of thumb

  • Keep the source connection when you expect schema changes.

  • Remove the source connection when you expect a new run only much later.

  • Revalidate after schema changes. Use Validate and synchronize workspace.

Nuances for this use case

  • If you validate row-level equality, freeze the source snapshot. Don’t regenerate from moving extracts.

  • If you change a view definition, resync before blaming the pipeline. Otherwise tests fail for the wrong reason.

  • Don’t write test inputs into schemas used by prod-like data. Keep a dedicated namespace for pipeline test inputs.

3

Configure generators

Workspace initialization mode

Choose a workspace mode. It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

  • De-identify when you want stable row-level behavior and predictable outputs for assertions.

  • Mock or mask all when you want to remove more than just PII but still keep formats stable.

  • From scratch when you only test a small number of pipeline-relevant tables.

AI-generated synthesis

Usually not the default for pipeline testing. It can change row-level behavior, which makes deterministic assertions harder.

Example (non-assertive smoke runs): synthesize a single staging_events_view to generate a larger, privacy-safe stream and validate pipeline robustness (parsing, scaling). Avoid using it for exact record-level checks.

Rule-based generation

Use this for known-good and known-bad rows. Use Calculated columns to inject boundary cases deterministically.

Example (type-conversion boundary): add a calculated override that sets amount = -1 for BAD_ROW_FLAG rows, and amount = 0 for another slice. Assert the pipeline rejects or routes them correctly.

Masking

Use this when you need format-preserving replacements while keeping stable joins across stages and systems.

Example (stable business keys): enable Consistent mapping for customer_id and order_id, then mask email and iban so joins remain stable from ingest → curated, without leaking PII.

Hybrid

Use this when you need stable joins (mask/de-identify) plus rule-driven guarantees for transformation correctness. It maps to “absolute calculations” from Example data generation scenarios.

Example (pipeline invariant always holds): if your pipeline derives net_amount = gross_amount - tax_amount, enforce that invariant in the input so you can test transformation drift.

Minimal configuration steps

  1. Run a PII scan on the pipeline inputs.

  2. Set join keys to stable handling (key generators + consistent mapping where needed).

  3. Add calculated-column assertions for the transformations you care about.

  4. Validate on a small slice before scaling.

circle-exclamation
chevron-rightConcrete example: creating “known bad” rows for pipeline assertionshashtag

Use a dedicated workspace when you need intentional “bad data”. Example: etl-negative-cases.

If Calculated columns are available in your version, inject a small percentage of erroneous rows:

4

Handle keys and relationships (relational schemas)

If your pipeline test is based on a single curated table (no joins), you can skip this step.

Make foreign keys explicit before the first full run. Otherwise you test the wrong join behavior.

Use Manage foreign keys. Add virtual foreign keys where the source schema is incomplete.

If you don’t have FKs in the database, start with the foreign key scanner. Then validate key coverage on real joins.

5

Validate and sync

Validate early on a subset. Confirm row counts and join cardinalities at key stages.

Re-run validation whenever schemas change. Use Validate and synchronize workspace. Pipeline tests depend on schema stability.

Also validate intermediate outputs in the pipeline:

  • Snapshot row counts per stage (ingest → staging → curated).

  • Compare distinct counts of business keys after dedup steps.

  • Assert null-rate expectations for critical columns.

6

Tune generation settings

Tune for repeatable runtime and stable write behavior. Pipeline tests often run in CI/CD.

Use View and adjust generation settings after the join graph is correct.

Common pitfalls & misconfigurations

Use case-specific pitfalls

  • Testing transformations without including representative edge-case inputs.

chevron-rightGeneral pitfallshashtag

These pitfalls show up in most projects:

Governance, compliance, and automation

Use case-specific recommendations

  • Align workspace versions with pipeline versions (example: etl_orders_v3). Don’t reuse a workspace across major pipeline rewrites.

  • Automate pipeline checks against the generated dataset (row counts, PK uniqueness, null-rate expectations, join cardinalities).

  • Keep join-key strategy explicit and reviewed (which keys are duplicated/hashed/generated). Document it with the pipeline test plan.

  • If you inject “known bad rows”, keep the flags and percentages stable across runs. Otherwise tests become flaky.

chevron-rightGeneral recommendationshashtag

Use these recommendations for most workspaces.

Ownership and change control

  • Assign a single workspace owner (data steward / privacy lead / DBA).

  • Require a ticket or change request for generator changes.

  • Duplicate a workspace before large edits. Keep the previous version as rollback.

Access control

  • Default to read-only access for source connections.

  • Restrict who can view source data in the UI.

  • Use separate workspaces per environment or audience.

Automation (baseline)

  • Use the Syntho REST API to standardize scans and runs.

  • Automate data generation not workspace configuration.

  • Keep job logs for failed runs. This reduces back-and-forth during support.

Last updated

Was this helpful?