Frequently asked questions
Last updated
Was this helpful?
Last updated
Was this helpful?
No, synthetic data does not directly replicate the real data it is based on. It uses patterns and statistical properties to generate new data that is independent of the original dataset. such as and are employed to further ensure that sensitive data points, like unique values or outliers, are not exposed.
In rare cases, synthetic data may contain values that match those in the original dataset. For example, if several individuals share identical characteristics in the real data, those attributes might also appear in the synthetic dataset. However, this does not pose a privacy risk, as these instances are not linked to specific individuals. Methods like K-anonymity are used to minimize such risks.
Syntho implements various privacy-preserving techniques, for more information please see .
Syntho requires an administrative user for installation. If you use Docker for the deployment, the user needs sudo rights for docker/docker-compose.
The Syntho platform works with source and destination databases.
The source database stores the input data for your data generation job and can be read-only.
The destination database must have empty tables and columns, which are structurally identical to the source database. It requires write access, as the generated data will be written to it.
The source and destination database can run on the same server, or be hosted on different servers.
Syntho uses relevant metadata stored in your source database for generating the destination database. For example, foreign key constraints are inferred to reproduce table relationships in the generated database.
Syntho is capable of handling scenarios where the constraints of the destination database are either enabled or disabled.
Non-sensitive tables (e.g. 'definition tables' such as language or product tables) can be copied over as-is when writing the data to the destination database. This can be done on the Job configuration panel by marking the table as De-identify.
To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.
Yes. In some scenarios, where you do not have any data yet, you may want to generate data from scratch to fill your empty database. This can be done using the Syntho platform in the following way:
First, setup an empty source database and destination database with the tables and columns that you want to fill. The source database should be different from the destination database.
Ensure primary and foreign key columns in your source database are correctly configured and already have some key values in there.
On the Job configuration panel, drag all relevant tables under Include.
Use the shortcut CTRL + SHIFT + ALT + 0
and set the value under the entry "key_generation_method" to "generate".
Configure the relevant Mocker settings.
Select the number of rows to generate under the Table settings.
Finally, select Generate to generate your database.
There are several ways to improve the utility (a.k.a. 'quality' or 'fidelity') of the generated synthetic data. A list of possible options include:
Yes. Besides Syntho's core functionality, which is synthesizing, Syntho also allows you to de-identify your most sensitive columns. The following section describes how this can be done:
The Syntho platform and Syntho's generative AI models are optimized to generate your data as fast and efficiently as possible. For example, tasks parallelization is applied at every step in the data generation process, from reading the source data until writing the destination data.
Here are some additional tips to boost the speed of your data generation jobs:
Temporarily Disable Constraints and Indexes on the destination database to bypass database checks and maximize parallel processing possibilities, enhancing writing speed.
Syntho allows you to circumvent the write performance of the database to become a bottleneck, which can become a factor for some database types when tables grow larger. Some options to mitigate database writing speeds limitations include:
The Syntho platform offers several features to optimize memory utilization during data generation. Understanding how these features work can help you manage memory more effectively and prevent issues such as out-of-memory (OOM) errors.
Syntho does not collect or store any data. The Syntho application will run in the secure environment of the customer. Syntho will be unable to access the platform, and does not collect any data from the application.
For more information, see Does Syntho collect any data?
Fore more information, visit: .
For each column, go to the Column settings and select as the generator. Note: are currently not supported for generating data from scratch.
Ensure your data is prepared according to the data preparation requirements described in the section.
Increase the value for the parameter, and consider including all rows. If the maximum number of rows used for training is set at a lower value than the total number of input rows, you can enable the parameter to get a more representative subset.
If you need to synthesize multiple related tables, follow the instructions and limitations in the feature.
any large tables (i.e. tables with many rows and columns) from the generation process.
for larger tables - the Syntho platform will ensure referential integrity remains in tact (i.e. subsetting).
When working with multiple tables, increase the to allow parallel writing of generated data.
For columns with enabled, consider adjusting the and / or reducing the number of high-cardinality columns (i.e. columns with many distinct values) in your tables.
For free text columns with applied, disable the feature.
Write the generated data to (Parquet) files in or , or your .
Increasing the could have a slight impact on writing speeds.
Consider taking a representative of those larger tables before synthesizing.
: The number of data points processed in a single batch. A higher batch size can increase the speed of data generation but requires more memory. Adjust the batch size based on the available memory in your cluster to find the optimal balance between performance and memory usage.
(N): This parameter controls the number of tables that can be read or written in parallel. Increasing N allows multiple tables to be processed simultaneously, which can speed up data generation. However, it also increases memory usage, so adjust this parameter according to the available memory and the complexity of your database schema (considering potential foreign key relationships).
(N): The number of rows from the source data used to train the generative model. Using more rows can improve the quality of the synthesized data but will require more memory. Monitor the memory usage and adjust N to avoid exceeding memory limits.
: Allows you to observe real-time memory usage, CPU utilization, and other resource metrics. Regularly monitor the Ray dashboard to track memory consumption and make necessary adjustments to batch size, parallel connections, and training rows.
: Errors such as "Workers (tasks/actors) killed due to memory pressure (OOM)" indicate that the cluster has run out of memory. If you encounter OOM errors, reduce the batch size, decrease the number of parallel connections, or lower the number of training rows to mitigate memory pressure.