9. Large workloads

Large workloads

Speeding up data generation jobs and reducing memory footprint

Working with large databases can significantly impact the performance and success of your synthetic data generation jobs. These tips will help you configure your workspace for large workloads by minimizing memory consumption and optimizing execution speed.

Lowering memory footprint

To reduce memory usage and avoid potential timeouts or job failures, consider these strategies:

Decrease parallel connections: Lower the number of concurrent connections to reduce memory usage.
Decrease read and write batch size: Smaller batches consume less memory per operation.
Limit free text PII detection: This is a resource-intensive process. Only enable it when absolutely necessary.
Reduce the number of training rows (AI synthesis only): Limiting the training data size speeds up processing and conserves resources.

Speeding up data generation jobs

To accelerate data generation for large-scale datasets, apply the following optimizations:

Increase parallel connections: More connections can speed up data reading and writing through parallel execution.
Enable schema-independent scheduling: By removing constraints in the destination schema, Syntho can parallelize processing based on the number of records instead of schema dependencies.
Write to Parquet instead of a database: Writing directly to a database is often slower. When dealing with very large datasets, consider exporting to efficient columnar file formats like Parquet.

Interactive guide: How to handle large workloads

Follow the interactive guide below to handle large workloads

Best practice

Always aim to use the minimal viable dataset to validate your configurations before executing large jobs. Scaling up becomes much easier and more stable when you're confident in your setup.

Previous8. Workspace & user management Next10. AI synthesis: Data pre-processing when using

Last updated 3 months ago

Was this helpful?