Use Case 4: ETL & data pipeline testing
End-to-end pipeline testing without using production data.
Use this use case when you need an end-to-end test dataset for pipelines. Focus on correctness across tables and across systems.
What problem this use case solves
Teams need to validate transformations and integrations. They need stable keys, relationships, and formats.
Classic anonymization can over-generalize values and reduce realism. In relational databases, you also need explicit handling of keys and foreign keys to keep referential integrity.
When to choose this use case
Pick this when you test data pipelines, not just tables.
If you’re unsure, start with de-identify and enable Consistent mapping for join keys. Then run Validate and synchronize workspace.
You validate ETL/ELT transformations end-to-end.
Your tests rely on stable joins and key coverage.
Your pipeline runs on a cadence (CI, nightly, releases).
Downstream systems validate formats and constraints.
Make joins explicit with Manage foreign keys.
When to avoid this use case
Skip this when data pipelines and ETL testing are not the target.
You only test one curated table with no joins. Use Use Case 1: Application & API Testing.
You mainly need performance and volume testing. Use Use Case 2: Load & Stress.
You validate workflows during a migration between platforms. Use Use Case 8: Cloud & Data Migration.
Recommended Syntho configuration
This setup is optimized for join-correct, repeatable pipeline runs. Your goal is functional correctness across transforms. Broken keys are a test failure.
Source & destination management
Create one workspace per pipeline (or per environment). This keeps generator changes traceable to pipeline changes.
Baseline rules
Keep the source stable. Prefer snapshots or back-ups.
Avoid a live production source for iterative work.
Keep the destination isolated. Never write into production.
Keep schemas aligned between source, workspace and destination.
Use views when you need only a subset of the original database.
Lifecycle rule of thumb
Keep the source connection when you expect schema changes.
Remove the source connection when you expect a new run only much later.
Revalidate after schema changes. Use Validate and synchronize workspace.
Nuances for this use case
If you validate row-level equality, freeze the source snapshot. Don’t regenerate from moving extracts.
If you change a view definition, resync before blaming the pipeline. Otherwise tests fail for the wrong reason.
Don’t write test inputs into schemas used by prod-like data. Keep a dedicated namespace for pipeline test inputs.
Configure generators
Workspace initialization mode
Choose a workspace mode. It applies baseline generator suggestions during workspace creation.
Recommended modes for this use case:
De-identify when you want stable row-level behavior and predictable outputs for assertions.
Mock or mask all when you want to remove more than just PII but still keep formats stable.
From scratch when you only test a small number of pipeline-relevant tables.
AI-generated synthesis
Usually not the default for pipeline testing. It can change row-level behavior, which makes deterministic assertions harder.
Example (non-assertive smoke runs): synthesize a single staging_events_view to generate a larger, privacy-safe stream and validate pipeline robustness (parsing, scaling). Avoid using it for exact record-level checks.
Rule-based generation
Use this for known-good and known-bad rows. Use Calculated columns to inject boundary cases deterministically.
Example (type-conversion boundary): add a calculated override that sets amount = -1 for BAD_ROW_FLAG rows, and amount = 0 for another slice. Assert the pipeline rejects or routes them correctly.
Masking
Use this when you need format-preserving replacements while keeping stable joins across stages and systems.
Example (stable business keys): enable Consistent mapping for customer_id and order_id, then mask email and iban so joins remain stable from ingest → curated, without leaking PII.
Hybrid
Use this when you need stable joins (mask/de-identify) plus rule-driven guarantees for transformation correctness. It maps to “absolute calculations” from Example data generation scenarios.
Example (pipeline invariant always holds): if your pipeline derives net_amount = gross_amount - tax_amount, enforce that invariant in the input so you can test transformation drift.
Minimal configuration steps
Run a PII scan on the pipeline inputs.
Set join keys to stable handling (key generators + consistent mapping where needed).
Add calculated-column assertions for the transformations you care about.
Validate on a small slice before scaling.
If joins break, you are no longer testing the pipeline. Fix PK/FK first.
Concrete example: creating “known bad” rows for pipeline assertions
Use a dedicated workspace when you need intentional “bad data”. Example: etl-negative-cases.
If Calculated columns are available in your version, inject a small percentage of erroneous rows:
Handle keys and relationships (relational schemas)
If your pipeline test is based on a single curated table (no joins), you can skip this step.
Make foreign keys explicit before the first full run. Otherwise you test the wrong join behavior.
Use Manage foreign keys. Add virtual foreign keys where the source schema is incomplete.
If you don’t have FKs in the database, start with the foreign key scanner. Then validate key coverage on real joins.
Validate and sync
Validate early on a subset. Confirm row counts and join cardinalities at key stages.
Re-run validation whenever schemas change. Use Validate and synchronize workspace. Pipeline tests depend on schema stability.
Also validate intermediate outputs in the pipeline:
Snapshot row counts per stage (ingest → staging → curated).
Compare distinct counts of business keys after dedup steps.
Assert null-rate expectations for critical columns.
Tune generation settings
Tune for repeatable runtime and stable write behavior. Pipeline tests often run in CI/CD.
Use View and adjust generation settings after the join graph is correct.
Common pitfalls & misconfigurations
Use case-specific pitfalls
Testing transformations without including representative edge-case inputs.
General pitfalls
These pitfalls show up in most projects:
Running full-scale jobs before a small validation run.
Skipping workspace validation/sync after schema changes. Use Validate and synchronize workspace.
Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with Manage foreign keys and virtual foreign keys.
Overusing Consistent mapping (it slows down data generation and increases linkability).
Governance, compliance, and automation
Use case-specific recommendations
Align workspace versions with pipeline versions (example:
etl_orders_v3). Don’t reuse a workspace across major pipeline rewrites.Automate pipeline checks against the generated dataset (row counts, PK uniqueness, null-rate expectations, join cardinalities).
Keep join-key strategy explicit and reviewed (which keys are duplicated/hashed/generated). Document it with the pipeline test plan.
If you inject “known bad rows”, keep the flags and percentages stable across runs. Otherwise tests become flaky.
General recommendations
Use these recommendations for most workspaces.
Ownership and change control
Assign a single workspace owner (data steward / privacy lead / DBA).
Require a ticket or change request for generator changes.
Duplicate a workspace before large edits. Keep the previous version as rollback.
Access control
Default to read-only access for source connections.
Restrict who can view source data in the UI.
Use separate workspaces per environment or audience.
Automation (baseline)
Use the Syntho REST API to standardize scans and runs.
Automate data generation not workspace configuration.
Keep job logs for failed runs. This reduces back-and-forth during support.
Last updated
Was this helpful?

