AI-powered generation
Last updated
Last updated
AI-powered generation can be especially useful in the following situations:
To generate synthetic feature dataset for ML model development.
When statistical accuracy and maximum privacy are needed.
To expand dataset rows while maintaining original statistical properties.
Open your Workspace.
On the Job Configuration tab, select the column icon on the top left of the column where you want to apply a mocker.
Under Column settings > Generation Method, select AI-powered generator to enable Syntho's machine learning (ML) models to automatically synthesize the data in your tables.
Set the relevant AI-powered generation parameters.
Select Confirm.
When using AI-powered synthetic data generation, it is important that your data is fit to synthesize.
Syntho expects your data to be stored in entity tables that satisfy the following:
To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.
Each entity is described in one row.
Each row can be treated independently. The order of the rows does not convey any information. The contents of one row also do not affect other rows.
Avoid column names with privacy-sensitive information, likepatient_a_medications
, patient_b_medications
, etc.. Instead, have a patient column with the names. This prevents patient names from being exposed in metadata or bypass rare category protection (e.g., there’s a patient_a
column, but this patient only appeared five times in the whole dataset).
Remove columns that are derived directly from other columns. For example, you may have a net_amount
column that is derived from the gross_amount
and taxes
columns. For categorical columns, there could be hierarchical relationships, such as a redundant Treatment category
column referring to a Treatment
column. Removing such redundant columns will simplify the modeling process and will lead to higher quality synthetic data.
Syntho is capable of processing data in the form of lists, sequences, or time series when structured in entity table-linked table structure. Ensure your data satisfies the following:
The structure is tailored for handling lists, sequences, or time-series data.
It includes two tables:
an entity table that satisfies the Entity tables requirements.
a linked table.
Each record in the entity table needs a unique ID (primary key).
Each record in the linked table must reference the unique ID from the entity table (foreign key).
Similar to the requirements for Entity tables, eliminate columns whose values are directly derived from other columns.
Remove row values that are derived directly from values in other rows. For instance, if your dataset includes sequences with start_date
and end_date
columns, and each start_date
matches the end_date
of the row before it, remove one of these redundant values, understart_date
or end_date
.
For more information on preparing your data when synthesizing complex table relationships see: Sequence model.
The Syntho platform supports a wide variety of data types. Under the hood, Syntho uses an encoding scheme where each data type is mapped to one of the following encoding types.
Syntho uses a discrete encoding type to synthesize numerical values that have a countable number of values between any two values. For example, the number of customer complaints or the number of flaws or defects.
To synthesize numerical values that have an infinite number of values between any two values, such as weight and height, Syntho uses a continuous encoding type.
A categorical column has one of a fixed number of possible values. These variables, like the blood type of a person (i.e., A, B, AB or O
), have a fixed set of categories. Categorical encoding prevents random values (for instance, M, X or Z
) from appearing in your synthetic dataset.
Under the Encoding > Advanced settings, the Rare category protection settings will appear, which can be used to protect rare categories. These categories could potentially re-identify outliers within the synthetic data.
Note: The categorical encoding type is the default fallback encoding type used by Syntho. This means that any database types that are unknown by Syntho will automatically be encoded as categorical.
The encoding type known as Datetime is used to describe values that incorporate either one of, or both a date component and a time component.
By using this encoding type, Syntho is able to synthesize these values and generate dates and times that are statistically valid and representative.
Syntho supports all date and datetime data types for the Syntho connectors.
Datetime columns support precision up to milliseconds. Nanosecond precision is not supported.
A universally unique identifier (UUID) is a 128-bit unique value, which is practically guaranteed to be different from other generated UUID. This property is used for fast and reliable indexing of data. Since it doesn't comply to any distribution it cannot be modeled, since it doesn't hold any intrinsic information besides for indexing purposes.
GEO types require special handling logic, due to their diversity of format and logical representation. There are options like POINT, POLYGON, LINE which can represent information like single geolocations, but also geographical areas or paths.
Syntho can generate POINTS, unrestricted by any external logic or heuristic. Some GEO types set are limits for new data points, like countries or cities. Syntho does not automatically preserve such logic.
Following the privacy-by-design principle, Syntho automatically replaces all rare categorical observations with a user-defined value in a column encoded as a categorical column.
Replacing those rare categories helps to prevent that those sensitive values leak through into the synthetic data.
Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.
Under Column settings > Encoding type, select Advanced settings to adjust the rare category protection threshold.
By default, the rare category protection threshold value is set at 10. This means that all column values that occur 10 times or less are automatically replaced by the user-defined value.
Under Column settings > Encoding type, select Advanced settings to adjust the rare category replacement value.
By default, the rare category replacement value is an asterisk (*). This means that all values that occur equal or fewer times than the rare category protection threshold value will be replaced with the replacement value.
Go to Table settings on the right panel, scroll down to see Advanced settings to view and adjust settings on the generator-level. Depending on the job configuration, a generator is applied to one or more columns.
You can adjust the following advanced generator settings:
Maximum rows used for training: The maximum number of rows to be used for training. Using fewer rows can speed up the process. Leave this value at None to use all rows for training.
Take random sample:
On: takes a random sample of rows used for training.
Off: takes the top rows as defined in the database.
Select Advanced settings under Encoding type to view and adjust settings on the column-level.
You can adjust the following advanced column settings, depending on the selected encoding type:
Clipping threshold: The floor and ceiling of a column as the Nth lowest and highest value, where N is the clipping threshold. The threshold value will process the values as not to exceed the ceiling and floor.
Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.
Locale: The locale used by the text processing models for columns with text containing PII.