LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mockers
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Apply AI synthesize
  • Preparing your data
  • Entity tables
  • Entity table-linked table dataset
  • Discrete
  • Continuous
  • Categorical
  • Datetime
  • Rare category protection
  • Advanced settings
  • Advanced generator settings
  • Advanced column settings

Was this helpful?

  1. Configure a Data Generation Job
  2. Configure column settings

AI synthesize

PreviousConfigure column settingsNextSequence model

Last updated 8 days ago

Was this helpful?

AI synthesize can be especially useful in the following situations:

  1. To generate synthetic feature dataset for ML model development.

  2. When statistical accuracy and maximum privacy are needed.

  3. To expand dataset rows while maintaining original statistical properties.

Apply AI synthesize

  1. Open your workspace.

  2. On the Table view tab, select the column icon on the top left of the column where you want to apply a generator.

  3. Under Column parameters > Generator, select AI synthesize to enable Syntho's machine learning (ML) models to automatically synthesize the data in your tables.

  4. Set the relevant AI synthesize parameters.

  5. Select Confirm.

Preparing your data

When using AI-powered synthetic data generation, it is important that your data is fit to synthesize.

Entity tables

Syntho expects your data to be stored in entity tables that satisfy the following:

  • To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

  • Each entity is described in one row.

  • Each row can be treated independently. The order of the rows does not convey any information. The contents of one row also do not affect other rows.

  • Avoid column names with privacy-sensitive information, likepatient_a_medications, patient_b_medications, etc.. Instead, have a patient column with the names. This prevents patient names from being exposed in metadata or bypass rare category protection (e.g., there’s a patient_a column, but this patient only appeared five times in the whole dataset).

  • Remove columns that are derived directly from other columns. For example, you may have a net_amount column that is derived from the gross_amount and taxes columns. For categorical columns, there could be hierarchical relationships, such as a redundant Treatment category column referring to a Treatment column. Removing such redundant columns will simplify the modeling process and will lead to higher quality synthetic data.

Entity table-linked table dataset

Syntho is capable of processing data in the form of lists, sequences, or time series when structured in entity table-linked table structure. Ensure your data satisfies the following:

  • The structure is tailored for handling lists, sequences, or time-series data.

  • It includes two tables:

    • a linked table.

  • Each record in the entity table needs a unique ID (primary key).

  • Each record in the linked table must reference the unique ID from the entity table (foreign key).

  • Remove row values that are derived directly from values in other rows. For instance, if your dataset includes sequences with start_date and end_date columns, and each start_date matches the end_date of the row before it, remove one of these redundant values, understart_date or end_date.

  • For more information on preparing your data when synthesizing complex table relationships see: Sequence model.

Discrete

Syntho uses a discrete encoding type to synthesize numerical values that have a countable number of values between any two values. For example, the number of customer complaints or the number of flaws or defects.

Continuous

To synthesize numerical values that have an infinite number of values between any two values, such as weight and height, Syntho uses a continuous encoding type.

Categorical

A categorical column has one of a fixed number of possible values. These variables, like the blood type of a person (i.e., A, B, AB or O), have a fixed set of categories. Categorical encoding prevents random values (for instance, M, X or Z) from appearing in your synthetic dataset.

Note: The categorical encoding type is the default fallback encoding type used by Syntho. This means that any database types that are unknown by Syntho will automatically be encoded as categorical.

Datetime

The encoding type known as Datetime is used to describe values that incorporate either one of, or both a date component and a time component.

By using this encoding type, Syntho is able to synthesize these values and generate dates and times that are statistically valid and representative.

Limitations

  • Datetime columns support precision up to milliseconds. Nanosecond precision is not supported.

Rare category protection

Following the privacy-by-design principle, Syntho automatically replaces all rare categorical observations with a user-defined value in a column encoded as a categorical column.

Replacing those rare categories helps to prevent that those sensitive values leak through into the synthetic data.

  • Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.

  • Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

Under Column parameters > Encoding type, select Advanced settings to adjust the Rare category protection threshold and Rare category replacement value.

By default, the rare category protection threshold value is set at 10. This means that all column values that occur 10 times or less are automatically replaced by the user-defined value.

Under Column settings > Encoding type, select Advanced settings to adjust the Rare category replacement value.

By default, the rare category replacement value is an asterisk (*). This means that all values that occur equal or fewer times than the rare category protection threshold value will be replaced with the replacement value.

Advanced settings

Advanced generator settings

Go to Table settings on the right panel, scroll down to see Advanced settings to view and adjust settings on the generator-level. Depending on the job configuration, a generator is applied to one or more columns.

You can adjust the following advanced generator settings:

  1. Maximum rows used for training: The maximum number of rows to be used for training. Using fewer rows can speed up the process. Leave this value at None to use all rows for training.

  2. Take random sample:

    • On: takes a random sample of rows used for training.

    • Off: takes the top rows as defined in the database.

Advanced column settings

Select Advanced settings under Encoding type to view and adjust settings on the column-level.

You can adjust the following advanced column settings, depending on the selected encoding type:

Discrete | Continuous | Datetime

  1. Clipping threshold: The floor and ceiling of a column as the Nth lowest and highest value, where N is the clipping threshold. The threshold value will process the values as not to exceed the ceiling and floor.

Categorical | Text containing PII

  1. Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.

  2. Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

  3. Locale: The locale used by the text processing models for columns with text containing PII.

an entity table that satisfies the .

Similar to the requirements for , eliminate columns whose values are directly derived from other columns.

Under the Encoding type > , the will appear, which can be used to protect rare categories. These categories could potentially re-identify outliers within the synthetic data.

Syntho supports all date and datetime data types for the .

Syntho connectors
Entity tables requirements
Entity tables
Advanced settings
Rare category protection settings
Selecting generators in column parameters
Example of an entity table (each row describes an individual patient, and be treated independently)
Example of a linked table (multiple rows can be linked to a same patient, describing a series of time events for that patient)
Advanced settings for a rare category

Supported data types

The Syntho platform supports a wide variety of data types. Under the hood, Syntho uses an encoding scheme where each data type is mapped to one of the following encoding types.

Data type
Description

Numerical counts (e.g. number of visits)

Continuous values (e.g. weight, temperature)

Predefined values (e.g. blood type, country)

Timestamps and dates (e.g. created at)

Discrete
Continuous
Categorical
Datetime