LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. AI synthesis: Data pre-processing when using
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
      • AI-generated synthetic data
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mock
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
      • Backup
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
      • Backup
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Comparison of data generated with different generators
  • AI-generated synthetic data is applied to players table
  • Mockers are applied to players table
  • Consistent Mapping with Mockers is applied to players table
  • Calculated Columns is applied to players table
  • Mask is applied to players table
  • Hash is applied to players table
  • Key generator is applied to players table

Was this helpful?

  1. Overview
  2. Get started

Introduction to data generators

PreviousSample datasetsNextAI-generated synthetic data

Last updated 24 days ago

Was this helpful?

The Syntho platform offers various data generators for diverse scenarios, taking into account the data's nature, privacy concerns, and specific use cases, allowing users to select the most appropriate options. The summary table provides an overview of these methods, detailing their relevance and use-case scenarios below. You can select any of the data generators to be forwarded to the detailed user guide sections.

The below features are key for the smart de-identification and rule-based synthetic data methods.

Key feature
Description
When to use
When not to use

Training a generative AI model on the original data to generate new rows that mimic, but have no 1-to-1 relation with original rows.

  • To generate synthetic feature dataset for ML model development

  • When statistical accuracy and maximum privacy are needed

  • To expand dataset rows while maintaining original statistical properties

  • When working with multiple related tables

  • When data consistency across systems is required

  • When you need to be able to revert to original records

  • If entirely new, unseen text values must be generated

Generating entirely new, user-defined values

For custom data generation without regard to preserving original column value relationships

When you need to maintain relationships with original data

To generate mock values that are consistently mapped from original values (e.g. Hank always becomes Jeffrey)

To ensure data consistency across tables, systems and data generation jobs

If fully random data, without consistency is desired

Anonymizes data by modifying values directly while preserving the format

When data needs to remain recognizable in format. For anonymizing PII fields in non-production environments

When preserving exact relationships or values is required.

Generating user-defined values based on custom logic

For complex data manipulations requiring specific business logic

For simple data generation tasks that don't need custom logic

Generates unique keys to ensure referential integrity across related tables.

When working with multiple tables needing unique keys for foreign-key relationships.

If relationships or foreign keys are not required.

Automatic discovery of most sensitive (i.e. PII/PHI) columns in you database

To discover most sensitive columns (i.e. PII / PHI)

When your data is not sensitive

Comparison of data generated with different generators

We demonstrate the application of each generators on a real baseball dataset, which includes players and seasons tables.

is applied to players table

In the first example, we see that an entirely new synthetic dataset was generated by the generative AI model based on the original dataset. The synthetic dataset preserves the statistics of the original dataset, but there is no 1:to:1 correspondence of synthetic records and original records. Note that for AI-generated synthetic data, a rare category replacement value of 10 was applied. This means that any name appearing fewer than 10 times in the nameFirstand nameLast columns was replaced with an asterisk to protect privacy.

Mockers are applied to specific columns in the players table, which are highlighted in yellow in the table above: 'country', 'birthDate', 'deathDate', 'nameFirst', and 'nameLast'.

Please note that other names can also be mapped to "Danielle" or "Olson"; however, whenever Syntho detects "Bill", it will always replace it with a mocker first name "Danielle". The same applies to "Kennedy" and "Olson" in the last name column. Consistency can be verified with other columns since they are duplicated without any change from source to destination, allowing original and synthetic tables to be matched for a better understanding of consistency.

IF([Gender] = 'M', MOCK_FIRST_NAME, IF([Gender] = 'F', MOCK_FIRST_NAME_FEMALE, 'nothing')) 

The Mask generator modifies values directly without creating new records or altering the original dataset structure. This approach allows data to remain recognizable in its format while being anonymized, which is particularly useful for fields containing identifiable attributes. In this example, the Mask generator is applied to specific columns in the players table to ensure sensitive information is anonymized. Columns like country (Random Character Swap), birthDate (Datetime Noise), deathDate (Hasher), nameFirst (Format Preserving Encryption) and nameLast (Random Character Swap) are anonymized using the Mask generator with respective masking functions. These columns contain sensitive information that could potentially identify individuals. When consistent mapping is enabled in the Mask settings, identical input values across records will always map to the same masked output values.

Hash is applied to players table

In this example, the Hash generator is applied to key columns in the players table to ensure unique identifiers while preserving referential integrity. Key column, which is id, is hashed using a consistent algorithm to create anonymized yet unique values. This method ensures that identifiers in the players table can be anonymized without compromising the ability to link related records across tables. With unique hashing enabled, the output value will always be unique.

In this example, Key Generators are used to generate unique keys or duplicate keys in the players table, ensuring referential integrity across related tables.

  • Duplicate copies the original key values exactly as they appear in the source data, preserving the relationships between primary and foreign keys. This ensures the key structure remains intact.

  • Hash converts original key values into hashed representations while preserving relationships across tables. The hashed values are obfuscated and irreversible, ensuring relational integrity is maintained.

  • Generate creates new, synthetic key values that do not correspond to the original keys. It produces entirely new keys and does not preserve the order or relationships of the source data.

are applied to players table

is applied to players table

If you enable , the values will be consistently mapped to the same value across the tables. For example, we enabled consistent mapping for two columns: "nameFirst" and "nameLast". We want to generate the same synthetic names and surnames (mockers) for the original names. See the illustrations from MySQL tables below, where mockers with consistent mapping map the name "Bill Kennedy" to "Danielle Olson".

is applied to players table

allow users to perform a broad spectrum of operations on data, ranging from simple arithmetic to complex logical and statistical computations. In below illustration, the following operation is applied:

is applied to players table

is applied to players table

Mockers
Consistent Mapping with Mockers
consistent mapping
Calculated Columns
Calculated columns
Mask
Key generator
AI-generated synthetic data
Mockers
Consistent Mapping with Mockers
Mask
Calculated Columns
Key generators
PII scanner
AI-generated synthetic data
AI-generated synthetic data is applied to players table
Mocker is applied to players table
Enabling Consistent Mapping "nameFirst" in players table
Consistent mapping with mockers is applied to players tables
Preview of the result of above Calculated Column function
Mask is applied to players table
Hash is applied to players table
3 different key generators are applied to id column of players table