Frequently asked questions

AI synthesis

Can real data be found in AI-generated synthetic data?

No. Synthetic data does not directly replicate the real data it is based on.

It uses learned patterns and statistical properties to generate new records, independent of the source dataset.

Privacy control mechanisms such as rare category protection and clipping thresholds further reduce the risk of exposing unique values or outliers.

Are there scenarios where similarities might appear between synthetic and real data?

In rare cases, synthetic data may contain values that match those in the original dataset. For example, if several individuals share identical characteristics in the real data, those attributes might also appear in the synthetic dataset. However, this does not pose a privacy risk, as these instances are not linked to specific individuals. Methods like K-anonymity are used to minimize such risks.

What measures are in place to prevent exposure of sensitive data?

Syntho implements various privacy-preserving techniques.

See Additional privacy controls for details.

How many training records do I need for AI synthesis?

To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

Does AI synthesis work for multiple tables?

AI synthesis works best on single tables.

For best utility with minimal resources, prepare your data as a single entity table.

If you have sequence data, AI synthesis can work with 2 tables using Syntho’s sequence model.

If you need to anonymize multiple related tables while keeping consistency across tables, Syntho’s other generators are typically more practical.

With AI synthesis, what can I do to improve synthetic data utility?

There are several ways to improve the utility (a.k.a. 'quality' or 'fidelity') of the generated synthetic data. A few options:

  1. Prepare a single entity table. Follow Preparing your data.

  2. Increase Maximum rows used for training. Consider using all rows.

  3. For sequence data, structure your data as 2 related tables. Follow Sequence model.

Deployment

What permissions do I need to deploy Syntho?

Syntho requires an administrative user for installation. If you use Docker for the deployment, the user needs sudo rights for docker/docker-compose.

Databases and generation jobs

How should I provision my databases for data generation?

The Syntho platform works with source and destination databases.

  • The source database stores the input data for your data generation job and can be read-only.

  • The destination database must have empty tables and columns, which are structurally identical to the source database. It requires write access, as the generated data will be written to it.

The source and destination databases can run on the same server, or be hosted on different servers.

How does Syntho handle constraints on my database?

Syntho uses relevant metadata stored in your source database for generating the data that is written into the destination database. For example, foreign key constraints are inferred to reproduce table relationships and referential integrity in the generated database.

Syntho is capable of handling scenarios where the constraints of the destination database are either enabled or disabled.

I have non-sensitive tables that I still want in my destination database. What should I do?

Non-sensitive tables (for example, definition tables such as language or product tables) can be copied as-is when writing to the destination database.

Use the Duplicate generator for the relevant columns.

For more information, see Table view.

I do not have any data yet. Can I generate data from scratch?

Yes. If you do not have any data yet, you can generate data from scratch to fill an empty database.

You can do this in the Syntho platform as follows:

  1. First, set up an empty source database and destination database with the tables and columns that you want to fill. The source database should be different from the destination database.

  2. Ensure primary and foreign key columns in your source database are correctly configured and already have some key values present.

  3. On the Job configuration panel, drag all relevant tables under Include.

  4. Use the shortcut CTRL + SHIFT + ALT + 0 and set the value under the entry "key_generation_method" to "generate".

  5. For each column, go to Column settings and select Mock as the generator. Note: Calculated columns are currently not supported for generating data from scratch.

  6. Configure the relevant Mock settings.

  7. Select the number of rows to generate under the Table settings.

  8. Finally, select Generate to generate your database.

Can I use Syntho for data de-identification, masking, or pseudonymization?

Yes, Syntho also allows you to mask or de-identify your most sensitive columns. The following section describes how this can be done:

Database de-identification

Performance and scaling

How can I speed up my data generation jobs?

The Syntho platform and its generative models are optimized to generate data fast and efficiently.

For example, task parallelization is applied throughout the pipeline, from reading source data to writing destination data.

Here are some additional tips to boost the speed of your data generation jobs:

  1. Exclude large tables from the generation process.

  2. Reduce the number of generated rows for large tables.

    • This only works if the table is not referenced by other tables.

  3. For columns with AI synthesize enabled:

  4. Disable PII obfuscation for columns containing free text.

  5. In some database types, write performance can become a bottleneck when working with larger tables. Syntho offers several options to optimize write performance:

    1. Use parallel writing by increasing the Number of simultaneous connections to allow parallel writing across multiple tables.

    2. Bypassing database checks (foreign key constraints, indexes, and identity/auto-increment behavior) can significantly reduce write time. This is especially effective with a higher maximum number of connections.

    3. Increasing the Write batch size may modestly improve write speed.

    4. Write generated data directly to Parquet files in Azure Data Lake Storage (ADLS) or Amazon Simple Storage Service (S3) to skip database checks.

    5. Consider synthesizing a representative subset of tables instead of the full dataset. You can prepare these subsets in your source database as (MATERIALIZED) VIEWS, which simplifies preprocessing and reduces storage costs. Syntho supports reading and processing views directly.

How can I optimize memory utilization of my cluster?

The Syntho platform offers several features to optimize memory utilization during data generation. Understanding how these features work can help you manage memory more effectively and prevent issues such as out-of-memory (OOM) errors.

  1. Batch size: The number of data points processed in a single batch. Higher values can speed up generation, but increase memory usage.

  2. Number of simultaneous connections (N): The number of tables that can be read or written in parallel. Higher values speed up generation, but increase memory usage.

  3. Maximum rows used for training (N): The number of source rows used to train the model. More rows can improve utility, but increase memory usage.

  4. Ray dashboard: Monitor memory usage, CPU utilization, and other runtime metrics.

  5. OOM error logs: Errors like Workers (tasks/actors) killed due to memory pressure (OOM) mean the cluster ran out of memory. Reduce batch size, reduce parallel connections, or lower training rows.

Data handling

Does Syntho collect or store any data?

Syntho does not collect or store any data.

Syntho runs in the customer’s environment, and Syntho cannot access customer data.

For more information, see Does Syntho collect any data?.

Last updated

Was this helpful?