Duplicate
Last updated
Last updated
Duplicate can be especially useful in the following situations:
When data does not contain personally identifiable information (PII) or sensitive elements, duplicating it allows for efficient replication without modification.
Open your Workspace.
On the Job Configuration tab, select the column icon on the top left of the column where you want to duplicate.
Under Column settings > Generation Method, select Duplicate to copy the column from the source table to the destination table as-is.
Set the relevant duplicate parameters.
Select Confirm.
Note: When you duplicate a column, the column is still used during the training process, as it can contain valuable information.
This means, however, that excluding columns cannot be used to to reduce hardware requirements or increase the speed of your synthetic data jobs.
Enable the Shuffle button to shuffle the generated values, while maintaining the overall frequency of values. For example, if you have 4 High, 3 Medium and 5 Low values in the source database, the same counts of values will exist in the destination database, except they are shuffled appear in a different order.
Note that the shuffle functionality works batch-wise in batches, so each batch generation according to the Generation Batch Size batch is shuffled independently. according to the set Generation Batch Size (the default value is 100k).
Note that NULL
values are also considered a distinct value, and will be shuffled like any other value.
Caution: Using the same underlying modelling techniques as the PII text obfuscation module, the Detect and obfuscate PII feature can take very long to run.
Enable the toggle Detect and obfuscate PII to use Syntho's PII text obfuscation module to detect and obfuscate PII entities in columns containing free text information.
When enabled, select the correct Locale, as based on the data in your text column, to ensure Syntho uses the appropriate language models to identify and obfuscate PII in your text column.
After enabling this options and setting the right locale, any identified PII entities are obfuscated and then copied to the destination table.
Syntho automatically replaces any infrequent categorical values in a column with a user-defined value, ensuring that sensitive data does not appear in the synthetic output.
Rare Category Protection Threshold: Column values that appear with a frequency at or below this threshold are automatically replaced to prevent data leakage.
Rare Category Replacement Value: Values meeting the frequency threshold are substituted with this user-specified replacement value.
By default, the rare category protection threshold is set to 10, meaning any value that appears 10 times or fewer will be replaced. The default replacement value is an asterisk (*), so all values at or below the threshold are replaced with (*).
To ensure accurate ordering, it is essential for the application to have either an index or a primary key in the source table. In the absence of these, the application defaults to sorting based on the first column of the table. However, if the first column contains duplicate values, the ordering cannot be guaranteed, as it relies on the database's sorting algorithm to handle duplicate values. Adding an index to the source table will resolve this issue.
Hive only
In the Table Settings panel, a new dropdown field allows users to specify which columns should be used in the "ORDER BY" clause. This feature enables users to define a set of columns that ensure the uniqueness of the returned results for a given table. By selecting the appropriate columns, users can achieve deterministic ordering even in the absence of primary keys or indexes.
Order By Dropdown: Located in the Table Settings panel on the right side of the Table/Job Configuration screen, this dropdown lets users choose the columns for the "ORDER BY" clause.