Use Case 2: Load & stress testing

Generate large volumes and edge cases for performance testing without sensitive production data.

Use this use case when performance testing needs realistic-looking data at scale. This often includes generating more rows than the original dataset.

What problem this use case solves

Teams need to validate performance under peak load. They need predictable schemas and realistic distributions.

Classic anonymization can reduce realism and change distributions. It can also keep direct identifiers unless handled carefully.

When to choose this use case

Pick this when the question is “will it perform at scale?”.

  • You need more rows than the source has.

  • You need realistic distributions for throughput and latency tests.

  • You need repeatable profiles (baseline vs peak vs worst-case).

  • You want to inject heavy rows to trigger worst-case behavior.

  • Add edge cases with Calculated columns.

When to avoid this use case

Skip this when correctness is more important than scale.

This setup is optimized for volume expansion and repeatable load profiles. You generate more rows than the source. You keep schemas stable so performance results are comparable.

1

Prerequisites

Checklist

circle-exclamation
2

Source & destination management

Create one workspace per performance profile. Examples: baseline, peak-load. Pick a destination that matches the target platform. Performance issues are platform-specific.

Baseline rules

  • Keep the source stable. Prefer snapshots or back-ups.

  • Avoid a live production source for iterative work.

  • Keep the destination isolated. Never write into production.

  • Keep schemas aligned between source, workspace and destination.

  • Use views when you need only a subset of the original database.

Lifecycle rule of thumb

  • Keep the source connection when you expect schema changes.

  • Remove the source connection when you expect a new run only much later.

  • Revalidate after schema changes. Use Validate and synchronize workspace.

Nuances for this use case

  • Don’t mix profiles in one destination. Use distinct schemas or databases per load test type.

  • Masking is not upsampling. It does not create new rows.

  • Heavy indexing and constraints can impact data generation throughput. Disable or minimize them when you measure app bottlenecks.

3

Configure generators

Workspace initialization mode

Choose a workspace mode. It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

  • De-identify when you mainly need production-like multi-table behavior (joins, constraints) and performance parity.

  • From scratch when you only load-test a small subset of tables and want to skip generator suggestions.

AI-generated synthesis

Use this when you need more rows while keeping realistic distributions for a single table. This is the primary method for load and stress on single tables.

Example (10M event rows): pick events as the entity table, set Rows to generate to 10,000,000, apply AI synthesize on non-key columns, and keep key generation on Generate so oversampling works.

Rule-based generation

Use this when you must inject worst-case rows that impact performance (large payloads, high cardinality keys, null-heavy records). Use Calculated columns to control the rate.

Example (payload spikes): add EDGE_FLAG (≈0.1%), then override request_payload_size_bytes to a large range only when EDGE_FLAG is true. Validate on 100k rows before scaling.

Masking

Use this when the workload needs format-valid data for ingestion validators, or when IDs must match a specific shape (UUID, IBAN, codes). It does not increase row count.

Example (format-safe ingestion): mask session_id to UUID format and country_code to an allowed code list so your ingestion pipeline doesn’t reject rows during stress tests.

Hybrid

Use this when you want AI synthesis for volume, plus rule-based extremes to trigger bottlenecks. This is the “background population + edge injection” pattern from Example data generation scenarios.

Example (peak + worst-case):

  1. AI synthesize the entity table to the target row count.

  2. Add EDGE_FLAG and override 1–2 columns that drive worst-case performance.

  3. Keep overrides rare enough to not dominate the benchmark.

Here’s a concrete “hot partition key” pattern that often reveals bottlenecks:

Minimal configuration steps

  1. Pick an entity table/view.

  2. Set target row count (start with 100,000, then scale).

  3. Apply AI synthesis to non-key columns and keep key generation on Generate.

  4. Inject edge cases with calculated columns only when needed.

chevron-rightConcrete example: configuring a “peak load” profilehashtag

Example workspace name: peak-load-v1.

  1. Pick the main workload table (entity table). Example: events.

  2. In Table view, set Rows to generate for events to 10,000,000.

  3. Apply AI synthesize on non-key columns. Keep key generation on Generate so oversampling is supported.

  4. Validate distributions on a small run before scaling (e.g., generate 100,000 rows first).

4

Handle keys and relationships (relational schemas)

If you generate a single entity table, you can skip this step.

Decide if the performance test truly needs multi-table joins. AI synthesis has known limits for cross-table consistency.

If you need multiple tables, read Cross-table relationships limitations. Prefer a single entity table, or switch to de-identification.

If you still generate relational outputs, plan new PKs and stable FKs.

5

Validate and sync

Validate configuraiton before scaling. If you scale first, you may have correct later.

6

Tune generation settings

This use case fails on throughput first. Tune write settings before running at full target size.

Use View and adjust generation settings and Large workloads tuning. Reduce batch sizes if you hit parameter-limit write errors.

Common pitfalls & misconfigurations

Use case-specific pitfalls

chevron-rightGeneral pitfallshashtag

These pitfalls show up in most projects:

Governance, compliance, and automation

Use case-specific recommendations

  • Treat each performance profile as a versioned artifact (baseline_v1, peak_v1).

  • Record benchmark parameters alongside the dataset (row count, batch sizes, connections, destination platform).

  • Automate a run: generate dataset → run benchmark → capture metrics + job logs → publish a short run report.

chevron-rightGeneral recommendationshashtag

Use these recommendations for most workspaces.

Ownership and change control

  • Assign a single workspace owner (data steward / privacy lead / DBA).

  • Require a ticket or change request for generator changes.

  • Duplicate a workspace before large edits. Keep the previous version as rollback.

Access control

  • Default to read-only access for source connections.

  • Restrict who can view source data in the UI.

  • Use separate workspaces per environment or audience.

Automation (baseline)

  • Use the Syntho REST API to standardize scans and runs.

  • Automate data generation not workspace configuration.

  • Keep job logs for failed runs. This reduces back-and-forth during support.

Last updated

Was this helpful?