AI-powered generation

Under Column settings > Generation Method, select AI-powered generator to enable Syntho's machine learning (ML) models to automatically synthesize the data in your tables.

Preparing your data

When using AI-powered synthetic data generation, it is important that your data is fit to synthesize.

Entity tables

Syntho expects your data to be stored in entity tables that satisfy the following:

  • To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

  • Each entity is described in one row.

  • Each row can be treated independently. The order of the rows does not convey any information. The contents of one row also do not affect other rows.

  • Avoid column names with privacy-sensitive information, likepatient_a_medications, patient_b_medications, etc.. Instead, have a patient column with the names. This prevents patient names from being exposed in metadata or bypass rare category protection (e.g., there’s a patient_a column, but this patient only appeared five times in the whole dataset).

  • Remove columns that are derived directly from other columns. For example, you may have a net_amount column that is derived from the gross_amount and taxes columns. For categorical columns, there could be hierarchical relationships, such as a redundant Treatment category column referring to a Treatment column. Removing such redundant columns will simplify the modeling process and will lead to higher quality synthetic data.

Example of an entity table (each row describes an individual patient, and be treated independently)

Entity table-linked table dataset

Syntho is capable of processing data in the form of lists, sequences, or time series when structured in entity table-linked table structure. Ensure your data satisfies the following:

  • The structure is tailored for handling lists, sequences, or time-series data.

  • It includes two tables:

  • Each record in the entity table needs a unique ID (primary key).

  • Each record in the linked table must reference the unique ID from the entity table (foreign key).

  • Similar to the requirements for Entity tables, eliminate columns whose values are directly derived from other columns.

  • Remove row values that are derived directly from values in other rows. For instance, if your dataset includes sequences with start_date and end_date columns, and each start_date matches the end_date of the row before it, remove one of these redundant values, understart_date or end_date.

  • For more information on preparing your data when synthesizing complex table relationships see: Sequence model.

Example of a linked table (multiple rows can be linked to a same patient, describing a series of time events for that patient)

Supported data types

The Syntho platform supports a wide variety of data types. Under the hood, Syntho uses an encoding scheme where each data type is mapped to one of the following encoding types.

Discrete

Syntho uses a discrete encoding type to synthesize numerical values that have a countable number of values between any two values. For example, the number of customer complaints or the number of flaws or defects.

Continuous

To synthesize numerical values that have an infinite number of values between any two values, such as weight and height, Syntho uses a continuous encoding type.

Categorical

A categorical column has one of a fixed number of possible values. These variables, like the blood type of a person (i.e., A, B, AB or O), have a fixed set of categories. Categorical encoding prevents random values (for instance, M, X or Z) from appearing in your synthetic dataset.

Under the Encoding > Advanced settings, the Rare category protection settings will appear, which can be used to protect rare categories. These categories could potentially re-identify outliers within the synthetic data.

Note: The categorical encoding type is the default fallback encoding type used by Syntho. This means that any database types that are unknown by Syntho will automatically be encoded as categorical.

Datetime

The encoding type known as Datetime is used to describe values that incorporate either one of, or both a date component and a time component.

By using this encoding type, Syntho is able to synthesize these values and generate dates and times that are statistically valid and representative.

Syntho supports all date and datetime data types for the Syntho connectors.

Limitations

  • Datetime columns support precision up to milliseconds. Nanosecond precision is not supported.

UUID

A universally unique identifier (UUID) is a 128-bit unique value, which is practically guaranteed to be different from other generated UUID. This property is used for fast and reliable indexing of data. Since it doesn't comply to any distribution it cannot be modeled, since it doesn't hold any intrinsic information besides for indexing purposes.

GEO

GEO types require special handling logic, due to their diversity of format and logical representation. There are options like POINT, POLYGON, LINE which can represent information like single geolocations, but also geographical areas or paths.

Limitations

  • Syntho can generate POINTS, unrestricted by any external logic or heuristic. Some GEO types set are limits for new data points, like countries or cities. Syntho does not automatically preserve such logic.

Rare category protection

Following the privacy-by-design principle, Syntho automatically replaces all rare categorical observations with a user-defined value in a column encoded as a categorical column.

Replacing those rare categories helps to prevent that those sensitive values leak through into the synthetic data.

  • Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.

  • Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

Under Column settings > Encoding type, select Advanced settings to adjust the rare category protection threshold.

By default, the rare category protection threshold value is set at 10. This means that all column values that occur 10 times or less are automatically replaced by the user-defined value.

Under Column settings > Encoding type, select Advanced settings to adjust the rare category replacement value.

By default, the rare category replacement value is an asterisk (*). This means that all values that occur equal or fewer times than the rare category protection threshold value will be replaced with the replacement value.

Advanced settings

Advanced generator settings

Go to Table settings on the right panel, scroll down to see Advanced settings to view and adjust settings on the generator-level. Depending on the job configuration, a generator is applied to one or more columns.

You can adjust the following advanced generator settings:

  1. Maximum rows used for training: The maximum number of rows to be used for training. Using fewer rows can speed up the process. Leave this value at None to use all rows for training.

  2. Take random sample:

    • On: takes a random sample of rows used for training.

    • Off: takes the top rows as defined in the database.

Advanced column settings

Select Advanced settings under Encoding type to view and adjust settings on the column-level.

You can adjust the following advanced column settings, depending on the selected encoding type:

Discrete | Continuous | Datetime

  1. Clipping threshold: The floor and ceiling of a column as the Nth lowest and highest value, where N is the clipping threshold. The threshold value will process the values as not to exceed the ceiling and floor.

Categorical | Text containing PII

  1. Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.

  2. Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

  3. Locale: The locale used by the text processing models for columns with text containing PII.

Last updated