Frequently asked questions

What permissions do I need to deploy Syntho?

Syntho requires an administrative user for installation. If you use Docker for the deployment, the user needs sudo rights for docker/docker-compose.

How should I provision my databases for data generation?

The Syntho platform works with source and destination databases.

  • The source database stores the input data for your data generation job and can be read-only.

  • The destination database must have empty tables and columns, which are structurally identical to the source database. It requires write access, as the generated data will be written to it.

The source and destination database can run on the same server, or be hosted on different servers.

How does Syntho handle constraints on my database?

Syntho uses relevant metadata stored in your source database for generating the destination database. For example, foreign key constraints are inferred to reproduce table relationships in the generated database.

Syntho is capable of handling scenarios where the constraints of the destination database are either enabled or disabled.

I have some non-sensitive tables that I still wish to include in my destination database. What do I do with those?

Non-sensitive tables (e.g. 'definition tables' such as language or product tables) can be copied over as-is when writing the data to the destination database. This can be done on the Job configuration panel by marking the table as De-identify.

Fore more information, visit: Configure table settings > Table modes.

How many training records do I need for AI-powered generation?

To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

I do not have any data yet. Can I also generate data from scratch?

Yes. In some scenarios, where you do not have any data yet, you may want to generate data from scratch to fill your empty database. This can be done using the Syntho platform in the following way:

  1. First, setup an empty source database and destination database with the tables and columns that you want to fill. The source database should be different from the destination database.

  2. Ensure primary and foreign key columns in your source database are correctly configured and already have some key values in there.

  3. On the Job configuration panel, drag all tables under De-identify.

  4. Use the shortcut CTRL + SHIFT + ALT + 0 and set the value under the entry "key_generation_method" to "generate".

  5. For each column, go to the Column settings and select Mocker or Calculated Column as the generator.

  6. Configure the relevant Mocker and Calculated Columns settings.

  7. Select the number of rows to generate under the Table settings.

  8. Finally, select Generate to generate your database.

What can I do to improve the synthetic data utility?

There are several ways to improve the utility (a.k.a. 'quality' or 'fidelity') of the generated synthetic data. A list of possible options include:

  1. Ensure your data is prepared according to the data preparation requirements described in the Preparing your data section.

  2. Increase the value for the Maximum rows used for training parameter, and consider including all rows. If the maximum number of rows used for training is set at a lower value than the total number of input rows, you can enable the Take random sample parameter to get a more representative subset.

  3. If you need to synthesize multiple related tables, follow the instructions and limitations in the Synthesize table relationships with entity-table ranking feature.

Can I also use Syntho for data de-identification or pseudonymization?

Yes. Besides Syntho's core functionality, which is synthesizing, Syntho also allows you to de-identify your most sensitive columns. The following section describes how this can be done:

Use Case: Database de-identification

How can I speed up my data generation jobs?

The Syntho platform and Syntho's generative AI models are optimized to generate your data as fast and efficiently as possible. For example, tasks parallelization is applied at every step in the data generation process, from reading the source data until writing the destination data.

Here are some additional tips to boost the speed of your data generation jobs:

  1. Exclude any large tables (i.e. tables with many rows and columns) from the generation process.

  2. For columns with AI-powered generation enabled, consider adjusting the advanced generation settings and / or reducing the number of high-cardinality columns (i.e. columns with many distinct values) in your tables.

  3. When working with multiple tables, you can increase the Number of simultaneous connections to allow parallel writing of generated data.

  4. Syntho is limited by the query and write performance of the database it is connected to. Especially, database writing speeds can have a significant impact for some database types when tables grow larger. Some options to mitigate database writing speeds limitations include:

    1. Consider writing the generated data to (Parquet) files in Azure Data Lake Storage or Amazon Simple Storage Service (S3).

    2. Increasing the Write batch size could have a slight impact on writing speeds.

    3. Consider taking a representative subset of those larger tables before synthesizing.

How can I optimize memory utilization of my cluster?

The Syntho platform offers several features to optimize memory utilization during data generation. Understanding how these features work can help you manage memory more effectively and prevent issues such as out-of-memory (OOM) errors.

  1. Batch Size: The number of data points processed in a single batch. A higher batch size can increase the speed of data generation but requires more memory. Adjust the batch size based on the available memory in your cluster to find the optimal balance between performance and memory usage.

  2. Parallel Connections (N): This parameter controls the number of tables that can be read or written in parallel. Increasing N allows multiple tables to be processed simultaneously, which can speed up data generation. However, it also increases memory usage, so adjust this parameter according to the available memory and the complexity of your database schema (considering potential foreign key relationships).

  3. Training Rows (N): The number of rows from the source data used to train the generative model. Using more rows can improve the quality of the synthesized data but will require more memory. Monitor the memory usage and adjust N to avoid exceeding memory limits.

  4. Ray Dashboard: Allows you to observe real-time memory usage, CPU utilization, and other resource metrics. Regularly monitor the Ray dashboard to track memory consumption and make necessary adjustments to batch size, parallel connections, and training rows.

  5. OOM Error Logs: Errors such as "Workers (tasks/actors) killed due to memory pressure (OOM)" indicate that the cluster has run out of memory. If you encounter OOM errors, reduce the batch size, decrease the number of parallel connections, or lower the number of training rows to mitigate memory pressure.

Last updated