# Use Case 2: Load & stress testing

Use this use case when performance testing needs realistic-looking data at scale. This often includes generating more rows than the original dataset.

### What problem this use case solves

Teams need to validate performance under peak load. They need predictable schemas and realistic distributions.

Classic anonymization can reduce realism and change distributions. It can also keep direct identifiers unless handled carefully.

### When to choose this use case

Pick this when the question is “will it perform at scale?”.

* You need more rows than the source has.
* You need realistic distributions for throughput and latency tests.
* You need repeatable profiles (baseline vs peak vs worst-case).
* You want to inject heavy rows to trigger worst-case behavior.
* Add edge cases with [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md).

### When to avoid this use case

Skip this when correctness is more important than scale.

* You need record-level parity, reconciliation, or preserving specific original values. Use [Use Case 4: ETL & Data Pipeline Testing](/overview/get-started/use-cases-and-configuration/use-case-4-etl-and-data-pipeline-testing.md).
* You need strict multi-table correctness for business logic andtesting. Use [Use Case 1: Application & API Testing](/overview/get-started/use-cases-and-configuration/use-case-1-application-and-api-testing.md).
* You need a smaller dataset, not a larger. Use [Use Case 10: Data Subsetting](/overview/get-started/use-cases-and-configuration/use-case-10-data-subsetting.md).

### Recommended Syntho configuration

This setup is optimized for **volume expansion and repeatable load profiles**. You generate more rows than the source. You keep schemas stable so performance results are comparable.

{% stepper %}
{% step %}

#### Prerequisites

* Use the [Prerequisites](/overview/get-started/prerequisites.md) checklist.

**Checklist**

* [ ] Target scale is defined (rows, GB, or TPS).
* [ ] Benchmark goal is clear (ingest, query, write throughput).
* [ ] PII handling decided before scaling.

{% hint style="warning" %}
Masking does **not** increase row counts. Use AI synthesis for upsampling.
{% endhint %}
{% endstep %}

{% step %}

#### Source & destination management

Create one workspace per performance profile. Examples: `baseline`, `peak-load`. Pick a destination that matches the target platform. Performance issues are platform-specific.

#### Baseline rules

* Keep the **source stable**. Prefer snapshots or back-ups.
* Avoid a **live production** source for iterative work.
* Keep the **destination isolated**. Never write into production.
* Keep **schemas aligned** between source, workspace and destination.
* Use **views** when you need only a subset of the original database.

#### Lifecycle rule of thumb

* Keep the source connection when you expect schema changes.
* Remove the source connection when you expect a new run only much later.
* Revalidate after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).

**Nuances for this use case**

* Don’t mix profiles in one destination. Use distinct schemas or databases per load test type.
* Masking is not upsampling. It does not create new rows.
* Heavy indexing and constraints can impact data generation throughput. Disable or minimize them when you measure app bottlenecks.
  {% endstep %}

{% step %}

#### Configure generators

**Workspace initialization mode**

Choose a [workspace mode](/setup-workspaces/create-a-workspace/workspace-modes.md). It applies baseline generator suggestions during workspace creation.

Recommended modes for this use case:

* **De-identify** when you mainly need production-like multi-table behavior (joins, constraints) and performance parity.
* **From scratch** when you only load-test a small subset of tables and want to skip generator suggestions.

**AI-generated synthesis**

Use this when you need **more rows** while keeping realistic distributions for a single table. This is the primary method for load and stress on single tables.

**Example (10M event rows):** pick `events` as the entity table, set **Rows to generate** to `10,000,000`, apply **AI synthesize** on non-key columns, and keep key generation on **Generate** so oversampling works.

**Rule-based generation**

Use this when you must inject **worst-case rows** that impact performance (large payloads, high cardinality keys, null-heavy records). Use [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md) to control the rate.

**Example (payload spikes):** add `EDGE_FLAG` (≈0.1%), then override `request_payload_size_bytes` to a large range only when `EDGE_FLAG` is true. Validate on 100k rows before scaling.

```excel-formula
// New column: EDGE_FLAG (≈0.1% of rows)
RAND() < 0.001
```

```excel-formula
// Override: request_payload_size_bytes
IF([EDGE_FLAG], RANDBETWEEN(500000, 2000000), [request_payload_size_bytes])
```

**Masking**

Use this when the workload needs **format-valid data** for ingestion validators, or when IDs must match a specific shape (UUID, IBAN, codes). It does not increase row count.

**Example (format-safe ingestion):** mask `session_id` to UUID format and `country_code` to an allowed code list so your ingestion pipeline doesn’t reject rows during stress tests.

**Hybrid**

Use this when you want **AI synthesis for volume**, plus **rule-based extremes** to trigger bottlenecks. This is the “background population + edge injection” pattern from [Example data generation scenarios](/overview/get-started/syntho-bootcamp/example-data-generation-scenarios.md).

**Example (peak + worst-case):**

1. AI synthesize the entity table to the target row count.
2. Add `EDGE_FLAG` and override 1–2 columns that drive worst-case performance.
3. Keep overrides rare enough to not dominate the benchmark.

Here’s a concrete “hot partition key” pattern that often reveals bottlenecks:

```excel-formula
// Override: partition_key (create skew/hotspot on purpose)
IF([EDGE_FLAG], "HOT_TENANT", [partition_key])
```

**Minimal configuration steps**

1. Pick an entity table/view.
2. Set target row count (start with `100,000`, then scale).
3. Apply AI synthesis to non-key columns and keep key generation on **Generate**.
4. Inject edge cases with calculated columns only when needed.

<details>

<summary>Concrete example: configuring a “peak load” profile</summary>

Example workspace name: `peak-load-v1`.

1. Pick the main workload table (entity table). Example: `events`.
2. In [Table view](/configure-a-data-generation-job/configure-table-settings.md), set **Rows to generate** for `events` to `10,000,000`.
3. Apply **AI synthesize** on non-key columns. Keep key generation on **Generate** so oversampling is supported.
4. Validate distributions on a small run before scaling (e.g., generate `100,000` rows first).

</details>
{% endstep %}

{% step %}

#### Handle keys and relationships (relational schemas)

If you generate a **single entity table**, you can skip this step.

Decide if the performance test truly needs multi-table joins. AI synthesis has known limits for cross-table consistency.

If you need multiple tables, read [Cross-table relationships limitations](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/table-relationships.md). Prefer a single entity table, or switch to de-identification.

If you still generate relational outputs, plan new PKs and stable FKs.

* [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md)
* [Key generators](/configure-a-data-generation-job/configure-column-settings/key-generators.md)
  {% endstep %}

{% step %}

#### Validate and sync

Validate configuraiton before scaling. If you scale first, you may have correct later.
{% endstep %}

{% step %}

#### Tune generation settings

This use case fails on throughput first. Tune write settings before running at full target size.

Use [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md) and [Large workloads](/overview/get-started/syntho-bootcamp/9.-large-workloads.md) tuning. Reduce batch size if you hit parameter-limit write errors.
{% endstep %}
{% endstepper %}

### Common pitfalls & misconfigurations

#### Use case-specific pitfalls

* Using masking for upsampling.
* Using AI synthesis for complex multi-table consistency requirements. See [Cross-table relationships limitations](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/table-relationships.md).
* Upsampling without validating that rare/edge values are actually present at target rates. Rebalance or inject edge cases using a hybrid approach. See [Example data generation scenarios](/overview/get-started/syntho-bootcamp/example-data-generation-scenarios.md).
* Running into write failures on big jobs due to batch sizing. If you hit parameter-limit errors, reduce the batch size. See [View and adjust generation settings](/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings.md).

<details>

<summary>General pitfalls</summary>

These pitfalls show up in most projects:

* Running full-scale jobs before a small validation run.
* Skipping workspace validation/sync after schema changes. Use [Validate and synchronize workspace](/configure-a-data-generation-job/generation-and-validation/validate-and-synchronize-workspace.md).
* Breaking relational integrity (missing PK/FK setup, missing foreign keys, missing virtual foreign keys). Start with [Manage foreign keys](/configure-a-data-generation-job/manage-foreign-keys.md) and [virtual foreign keys](/configure-a-data-generation-job/manage-foreign-keys/add-virtual-foreign-keys/add-virtual-foreign-keys.md).
* Leaving sensitive columns on [**Duplicate**](/configure-a-data-generation-job/configure-column-settings/duplicate.md), or trusting the [PII scan](/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner.md) without reviewing false positives/negatives.
* Overusing [**Consistent mapping**](/configure-a-data-generation-job/configure-column-settings/consistent-mapping.md) (it slows down data generation and increases linkability).

</details>

### Governance, compliance, and automation

#### Use case-specific recommendations

* Treat each performance profile as a versioned artifact (`baseline_v1`, `peak_v1`).
* Record benchmark parameters alongside the dataset (row count, batch size, connections, destination platform).
* Automate a run: generate dataset → run benchmark → capture metrics + job logs → publish a short run report.

<details>

<summary>General recommendations</summary>

Use these recommendations for most workspaces.

#### Ownership and change control

* Assign a single **workspace owner** (data steward / privacy lead / DBA).
* Require a ticket or change request for generator changes.
* Duplicate a workspace before large edits. Keep the previous version as rollback.

#### Access control

* Default to **read-only** access for source connections.
* Restrict **who can view source data** in the UI.
* Use separate workspaces per environment or audience.

#### Automation (baseline)

* Use the [Syntho REST API](/syntho-api/syntho-rest-api.md) to standardize scans and runs.
* Automate data generation not workspace configuration.
* Keep job logs for failed runs. This reduces back-and-forth during support.

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/use-cases-and-configuration/use-case-2-load-and-stress-testing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.