Duplicate

Under Column settings > Generation Method, select Duplicate to copy the column from the source table to the destination table as-is.

Note: When you duplicate a column, the column is still used during the training process, as it can contain valuable information.

This means, however, that excluding columns cannot be used to to reduce hardware requirements or increase the speed of your synthetic data jobs.

Shuffle

Enable the Shuffle button to shuffle the generated values, while maintaining the overall frequency of values. For example, if you have 4 High, 3 Medium and 5 Low values in the source database, the same counts of values will exist in the destination database, except they are shuffled appear in a different order.

Note that the shuffle functionality works batch-wise in batches, so each batch generation according to the Generation Batch Size batch is shuffled independently. according to the set Generation Batch Size (the default value is 100k).

Note that NULL values are also considered a distinct value, and will be shuffled like any other value.

Detect and obfuscate PII

Caution: Using the same underlying modelling techniques as the PII text obfuscation module, the Detect and obfuscate PII feature can take very long to run.

Enable the toggle Detect and obfuscate PII to use Syntho's PII text obfuscation module to detect and obfuscate PII entities in columns containing free text information.

When enabled, select the correct Locale, as based on the data in your text column, to ensure Syntho uses the appropriate language models to identify and obfuscate PII in your text column.

After enabling this options and setting the right locale, any identified PII entities are obfuscated and then copied to the destination table.

Ordering and Indexing Considerations

To ensure accurate ordering, it is essential for the application to have either an index or a primary key in the source table. In the absence of these, the application defaults to sorting based on the first column of the table. However, if the first column contains duplicate values, the ordering cannot be guaranteed, as it relies on the database's sorting algorithm to handle duplicate values. Adding an index to the source table will resolve this issue.

Column Set for "ORDER BY" Clause

Hive only

In the Table Settings panel, a new dropdown field allows users to specify which columns should be used in the "ORDER BY" clause. This feature enables users to define a set of columns that ensure the uniqueness of the returned results for a given table. By selecting the appropriate columns, users can achieve deterministic ordering even in the absence of primary keys or indexes.

  • Order By Dropdown: Located in the Table Settings panel on the right side of the Table/Job Configuration screen, this dropdown lets users choose the columns for the "ORDER BY" clause.

Last updated