Introduction to data generators
The Syntho platform offers various data generators for diverse scenarios, taking into account the data's nature, privacy concerns, and specific use cases, allowing users to select the most appropriate options. The summary table provides an overview of these methods, detailing their relevance and use-case scenarios below. You can select any of the data generators to be forwarded to the detailed user guide sections.
Data generators | Description | When to use | When not to use |
---|---|---|---|
AI-generated synthetic data consists of entirely new rows that mimic, but has no 1-to-1 relation with original rows. |
|
| |
Smart discovery and protection of the most sensitive data columns (i.e. PII/PHI) in a database. |
|
| |
Rule-based-synthetic data (using Mockers and Calculated Columns) | Generate data from scratch based on user-defined logic and rules. |
|
|
The below features are key for the smart de-identification and rule-based synthetic data methods.
Key feature | Description | When to use | When not to use |
---|---|---|---|
Training a generative AI model on the original data to generate new rows that mimic, but have no 1-to-1 relation with original rows. |
|
| |
Generating entirely new, user-defined values | For custom data generation without regard to preserving original column value relationships | When you need to maintain relationships with original data | |
To generate mock values that are consistently mapped from original values (e.g. Hank always becomes Jeffrey) | To ensure data consistency across tables, systems and data generation jobs | If fully random data, without consistency is desired | |
Generating user-defined values based on custom logic | For complex data manipulations requiring specific business logic | For simple data generation tasks that don't need custom logic | |
Automatic discovery of most sensitive (i.e. PII/PHI) columns in you database | To discover most sensitive columns (i.e. PII / PHI) | When your data is not sensitive |
Comparison of data generated with different generators
We demonstrate the application of each generators on a real baseball dataset, which includes players and seasons tables.
AI-generated synthetic data is applied to players table
In the first example, we see that an entirely new synthetic dataset was generated by the generative AI model based on the original dataset. The synthetic dataset preserves the statistics of the original dataset, but there is no 1:to:1 correspondence of synthetic records and original records. Note that for AI-generated synthetic data, a rare category replacement value of 10 was applied. This means that any name appearing fewer than 10 times in the nameFirst
and nameLast
columns was replaced with an asterisk to protect privacy.
Mockers are applied to players table
Mockers are applied to specific columns in the players table, which are highlighted in yellow in the table above: 'country', 'birthDate', 'deathDate', 'nameFirst', and 'nameLast'.
Consistent Mapping with Mockers is applied to players table
If you enable consistent mapping, the values will be consistently mapped to the same value across the tables. For example, we enabled consistent mapping for two columns: "nameFirst" and "nameLast". We want to generate the same synthetic names and surnames (mockers) for the original names. See the illustrations from MySQL tables below, where mockers with consistent mapping map the name "Bill Kennedy" to "Danielle Olson".
Please note that other names can also be mapped to "Danielle" or "Olson"; however, whenever Syntho detects "Bill", it will always replace it with a mocker first name "Danielle". The same applies to "Kennedy" and "Olson" in the last name column. Consistency can be verified with other columns since they are duplicated without any change from source to destination, allowing original and synthetic tables to be matched for a better understanding of consistency.
Calculated columns allow users to perform a broad spectrum of operations on data, ranging from simple arithmetic to complex logical and statistical computations. In above illustration, the following operation is applied:
IFNA(IFS(height>74, "Tall", height>72, "Medium", height>70, "Small"), "NA")
Last updated