Use Case 2: Load & stress testing
Generate large volumes and edge cases for performance testing without sensitive production data.
Use this use case when performance testing needs realistic-looking data at scale. This often includes generating more rows than the original dataset.
What problem this use case solves
Teams need to validate performance under peak load. They need predictable schemas and realistic distributions.
Classic anonymization can reduce realism and change distributions. It can also keep direct identifiers unless handled carefully.
When to choose this use case
Pick this when the question is “will it perform at scale?”.
You need more rows than the source has.
You need realistic distributions for throughput and latency tests.
You need repeatable profiles (baseline vs peak vs worst-case).
You want to inject heavy rows to trigger worst-case behavior.
Add edge cases with Calculated columns.
When to avoid this use case
Skip this when correctness is more important than scale.
You need record-level parity, reconciliation, or preserving specific original values. Use Use Case 4: ETL & Data Pipeline Testing.
You need strict multi-table correctness for business logic andtesting. Use Use Case 1: Application & API Testing.
You need a smaller dataset, not a larger. Use Use Case 10: Data Subsetting.
Recommended Syntho configuration
This setup is optimized for volume expansion and repeatable load profiles. You generate more rows than the source. You keep schemas stable so performance results are comparable.
Prerequisites
Use the Prerequisites checklist.
Checklist
Masking does not increase row counts. Use AI synthesis for upsampling.
Source & destination management
Create one workspace per performance profile. Examples: baseline, peak-load. Pick a destination that matches the target platform. Performance issues are platform-specific.
Baseline rules
Keep the source stable. Prefer snapshots or back-ups.
Avoid a live production source for iterative work.
Keep the destination isolated. Never write into production.
Keep schemas aligned between source, workspace and destination.
Use views when you need only a subset of the original database.
Lifecycle rule of thumb
Keep the source connection when you expect schema changes.
Remove the source connection when you expect a new run only much later.
Revalidate after schema changes. Use Validate and synchronize workspace.
Nuances for this use case
Don’t mix profiles in one destination. Use distinct schemas or databases per load test type.
Masking is not upsampling. It does not create new rows.
Heavy indexing and constraints can impact data generation throughput. Disable or minimize them when you measure app bottlenecks.
Configure generators
Workspace initialization mode
Choose a workspace mode. It applies baseline generator suggestions during workspace creation.
Recommended modes for this use case:
De-identify when you mainly need production-like multi-table behavior (joins, constraints) and performance parity.
From scratch when you only load-test a small subset of tables and want to skip generator suggestions.
AI-generated synthesis
Use this when you need more rows while keeping realistic distributions for a single table. This is the primary method for load and stress on single tables.
Example (10M event rows): pick events as the entity table, set Rows to generate to 10,000,000, apply AI synthesize on non-key columns, and keep key generation on Generate so oversampling works.
Rule-based generation
Use this when you must inject worst-case rows that impact performance (large payloads, high cardinality keys, null-heavy records). Use Calculated columns to control the rate.
Example (payload spikes): add EDGE_FLAG (≈0.1%), then override request_payload_size_bytes to a large range only when EDGE_FLAG is true. Validate on 100k rows before scaling.
Masking
Use this when the workload needs format-valid data for ingestion validators, or when IDs must match a specific shape (UUID, IBAN, codes). It does not increase row count.
Example (format-safe ingestion): mask session_id to UUID format and country_code to an allowed code list so your ingestion pipeline doesn’t reject rows during stress tests.
Hybrid
Use this when you want AI synthesis for volume, plus rule-based extremes to trigger bottlenecks. This is the “background population + edge injection” pattern from Example data generation scenarios.
Example (peak + worst-case):
AI synthesize the entity table to the target row count.
Add
EDGE_FLAGand override 1–2 columns that drive worst-case performance.Keep overrides rare enough to not dominate the benchmark.
Here’s a concrete “hot partition key” pattern that often reveals bottlenecks:
Minimal configuration steps
Pick an entity table/view.
Set target row count (start with
100,000, then scale).Apply AI synthesis to non-key columns and keep key generation on Generate.
Inject edge cases with calculated columns only when needed.
Concrete example: configuring a “peak load” profile
Example workspace name: peak-load-v1.
Pick the main workload table (entity table). Example:
events.In Table view, set Rows to generate for
eventsto10,000,000.Apply AI synthesize on non-key columns. Keep key generation on Generate so oversampling is supported.
Validate distributions on a small run before scaling (e.g., generate
100,000rows first).
Handle keys and relationships (relational schemas)
If you generate a single entity table, you can skip this step.
Decide if the performance test truly needs multi-table joins. AI synthesis has known limits for cross-table consistency.
If you need multiple tables, read Cross-table relationships limitations. Prefer a single entity table, or switch to de-identification.
If you still generate relational outputs, plan new PKs and stable FKs.
Tune generation settings
This use case fails on throughput first. Tune write settings before running at full target size.
Use View and adjust generation settings and Large workloads tuning. Reduce batch sizes if you hit parameter-limit write errors.
Common pitfalls & misconfigurations
Use case-specific pitfalls
Using masking for upsampling.
Using AI synthesis for complex multi-table consistency requirements. See Cross-table relationships limitations.
Upsampling without validating that rare/edge values are actually present at target rates. Rebalance or inject edge cases using a hybrid approach. See Example data generation scenarios.
Running into write failures on big jobs due to batch sizing. If you hit parameter-limit errors, reduce the write batch size. See View and adjust generation settings.
General pitfalls
These pitfalls show up in most projects:
Running full-scale jobs before a small validation run.
Skipping workspace validation/sync after schema changes. Use Validate and synchronize workspace.
Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with Manage foreign keys and virtual foreign keys.
Overusing Consistent mapping (it slows down data generation and increases linkability).
Governance, compliance, and automation
Use case-specific recommendations
Treat each performance profile as a versioned artifact (
baseline_v1,peak_v1).Record benchmark parameters alongside the dataset (row count, batch sizes, connections, destination platform).
Automate a run: generate dataset → run benchmark → capture metrics + job logs → publish a short run report.
General recommendations
Use these recommendations for most workspaces.
Ownership and change control
Assign a single workspace owner (data steward / privacy lead / DBA).
Require a ticket or change request for generator changes.
Duplicate a workspace before large edits. Keep the previous version as rollback.
Access control
Default to read-only access for source connections.
Restrict who can view source data in the UI.
Use separate workspaces per environment or audience.
Automation (baseline)
Use the Syntho REST API to standardize scans and runs.
Automate data generation not workspace configuration.
Keep job logs for failed runs. This reduces back-and-forth during support.
Last updated
Was this helpful?

