Duplicate

Duplicate can be especially useful in the following situations:

  1. When data does not contain personally identifiable information (PII) or sensitive elements, duplicating it allows for efficient replication without modification.

Apply duplicate

  1. Open your Workspace.

  2. On the Job Configuration tab, select the column icon on the top left of the column where you want to duplicate.

  3. Under Column settings > Generation Method, select Duplicate to copy the column from the source table to the destination table as-is.

  4. Set the relevant duplicate parameters.

  5. Select Confirm.

electing Duplicate in Generation Method panel

Note: When you duplicate a column, the column is still used during the training process, as it can contain valuable information.

This means, however, that excluding columns cannot be used to to reduce hardware requirements or increase the speed of your synthetic data jobs.

Shuffle data

Enable the Shuffle button to shuffle the generated values, while maintaining the overall frequency of values. For example, if you have 4 High, 3 Medium and 5 Low values in the source database, the same counts of values will exist in the destination database, except they are shuffled appear in a different order.

Note that the shuffle functionality works batch-wise in batches, so each batch generation according to the Generation Batch Size batch is shuffled independently. according to the set Generation Batch Size (the default value is 100k).

Note that NULL values are also considered a distinct value, and will be shuffled like any other value.

Detect and obfuscate PII

Enable the toggle Detect and obfuscate PII to use Syntho's PII text obfuscation module to detect and obfuscate PII entities in columns containing free text information.

When enabled, select the correct Locale, as based on the data in your text column, to ensure Syntho uses the appropriate language models to identify and obfuscate PII in your text column.

After enabling this options and setting the right locale, any identified PII entities are obfuscated and then copied to the destination table.

Rare category protection

Syntho automatically replaces any infrequent categorical values in a column with a user-defined value, ensuring that sensitive data does not appear in the synthetic output.

  • Rare category protection threshold: Column values that appear with a frequency at or below this threshold are automatically replaced to prevent data leakage.

  • Rare category replacement value: Values meeting the frequency threshold are substituted with this user-specified replacement value.

Rare category protection

By default, the rare category protection threshold is set to 10, meaning any value that appears 10 times or fewer will be replaced. The default replacement value is an asterisk (*), so all values at or below the threshold are replaced with (*).

Ordering and indexing considerations

To ensure accurate ordering, please see ordering and indexing considerations.

Supported data types

Generator
Supported data types

Duplicate

Categorical, Continuous, Discrete, Datetime, Bytes, Bool, UUID, JSON, XML, Geo, Sets, Unknown

Last updated

Was this helpful?