Introduction to data generators

The Syntho platform offers various data generators for diverse scenarios, taking into account the data's nature, privacy concerns, and specific use cases, allowing users to select the most appropriate options. The summary table provides an overview of these methods, detailing their relevance and use-case scenarios below. You can select any of the data generators to be forwarded to the detailed user guide sections.

Data generatorsDescriptionWhen to useWhen not to use

AI-generated synthetic data consists of entirely new rows that mimic, but has no 1-to-1 relation with original rows.

  • To generate synthetic feature dataset for ML model development

  • When statistical accuracy and maximum privacy are needed

  • To expand dataset rows while maintaining original statistical distributions

  • When working with multiple interrelated tables

  • When data consistency across systems is required

  • When you need to be able to revert to original records

  • If entirely new, unseen categories must be generated

Smart discovery and protection of the most sensitive data columns (i.e. PII/PHI) in a database.

  • When data consistency across tables, systems, and data generation jobs must be preserved

  • When working with large and complex databases for internal purposes

  • To expand dataset size (i.e. upsampling)

  • When data is not sensitive

Rule-based-synthetic data (using Mockers and Calculated Columns)

Generate data from scratch based on user-defined logic and rules.

  • When there is no real data available yet

  • To extend or enhance existing data

  • As data used for analytics or ML modeling purposes

The below features are key for the smart de-identification and rule-based synthetic data methods.

Key featureDescriptionWhen to useWhen not to use

Training a generative AI model on the original data to generate new rows that mimic, but have no 1-to-1 relation with original rows.

  • To generate synthetic feature dataset for ML model development

  • When statistical accuracy and maximum privacy are needed

  • To expand dataset rows while maintaining original statistical properties

  • When working with multiple related tables

  • When data consistency across systems is required

  • When you need to be able to revert to original records

  • If entirely new, unseen text values must be generated

Generating entirely new, user-defined values

For custom data generation without regard to preserving original column value relationships

When you need to maintain relationships with original data

To generate mock values that are consistently mapped from original values (e.g. Hank always becomes Jeffrey)

To ensure data consistency across tables, systems and data generation jobs

If fully random data, without consistency is desired

Generating user-defined values based on custom logic

For complex data manipulations requiring specific business logic

For simple data generation tasks that don't need custom logic

Automatic discovery of most sensitive (i.e. PII/PHI) columns in you database

To discover most sensitive columns (i.e. PII / PHI)

When your data is not sensitive

Comparison of data generated with different generators

We demonstrate the application of each generators on a real baseball dataset, which includes players and seasons tables.

AI-generated synthetic data is applied to players table

In the first example, we see that an entirely new synthetic dataset was generated by the generative AI model based on the original dataset. The synthetic dataset preserves the statistics of the original dataset, but there is no 1:to:1 correspondence of synthetic records and original records. Note that for AI-generated synthetic data, a rare category replacement value of 10 was applied. This means that any name appearing fewer than 10 times in the nameFirstand nameLast columns was replaced with an asterisk to protect privacy.

Mockers are applied to players table

Mockers are applied to specific columns in the players table, which are highlighted in yellow in the table above: 'country', 'birthDate', 'deathDate', 'nameFirst', and 'nameLast'.

Consistent Mapping with Mockers is applied to players table

If you enable consistent mapping, the values will be consistently mapped to the same value across the tables. For example, we enabled consistent mapping for two columns: "nameFirst" and "nameLast". We want to generate the same synthetic names and surnames (mockers) for the original names. See the illustrations from MySQL tables below, where mockers with consistent mapping map the name "Bill Kennedy" to "Danielle Olson".

Please note that other names can also be mapped to "Danielle" or "Olson"; however, whenever Syntho detects "Bill", it will always replace it with a mocker first name "Danielle". The same applies to "Kennedy" and "Olson" in the last name column. Consistency can be verified with other columns since they are duplicated without any change from source to destination, allowing original and synthetic tables to be matched for a better understanding of consistency.

Calculated columns allow users to perform a broad spectrum of operations on data, ranging from simple arithmetic to complex logical and statistical computations. In above illustration, the following operation is applied:

IFNA(IFS(height>74, "Tall", height>72, "Medium", height>70, "Small"), "NA")

Last updated