# Duplicate

Duplicate can be especially useful in the following situations:

1. When data does not contain personally identifiable information (PII) or sensitive elements, duplicating it allows for efficient replication without modification.

## Apply duplicate

1. Open your **Workspace**.
2. From the **Main hub** or **Table view** tab, select the column where you want to apply a generator.
3. Under **Generator,** select **Duplicate** to copy the column from the source table to the destination table *as-is.*
4. Set the relevant duplicate parameters.
5. Select **Confirm**.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/MpirpCqjRkW1Kdycrrhc/image.png" alt="" width="563"><figcaption><p>Selecting Duplicate in Generation Method panel</p></figcaption></figure>

{% hint style="info" %}
**Note:** When you duplicate a column, the column is still used during the training process, as it can contain valuable information.

This means, however, that excluding columns *cannot* be used to to reduce hardware requirements or increase the speed of your synthetic data jobs.
{% endhint %}

## Shuffle data

Enable the **Shuffle** button to shuffle the generated values, while maintaining the overall frequency of values. For example, if you have 4 High, 3 Medium and 5 Low values in the source database, the same counts of values will exist in the destination database, except they are shuffled appear in a different order.

Note that the shuffle functionality works batch-wise in batches, so each batch generation according to the Generation Batch Size batch is shuffled independently. according to the set **Generation Batch Size** (the default value is 100k).

Note that `NULL` values are also considered a distinct value, and will be shuffled like any other value.

## Detect and obfuscate PII

{% hint style="warning" %}
**Caution***:* Using the same underlying modelling techniques as the [PII text obfuscation module](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/duplicate/automatic-pii-discovery-and-de-identification-in-free-text-columns), the *Detect and obfuscate PII* feature can take very long to run.
{% endhint %}

Enable the toggle **Detect and obfuscate PII** to use Syntho's [PII text obfuscation module](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/duplicate/automatic-pii-discovery-and-de-identification-in-free-text-columns) to detect and obfuscate PII entities in columns containing free text information.

When enabled, select the correct **Locale,** as based on the data in your text column, to ensure Syntho uses the appropriate language models to identify and obfuscate PII in your text column.

After enabling this options and setting the right locale, any identified PII entities are obfuscated and then copied to the destination table.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/hjAX0rGZTwSQCWzr0yrj/image.png" alt="" width="538"><figcaption><p>Detect and obfuscate PII</p></figcaption></figure>

### **Rare category protection**

Syntho automatically replaces any infrequent categorical values in a column with a user-defined value, ensuring that sensitive data does not appear in the synthetic output.

* **Rare category protection threshold**: Column values that appear with a frequency at or below this threshold are automatically replaced to prevent data leakage.
* **Rare category replacement value**: Values meeting the frequency threshold are substituted with this user-specified replacement value.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/D75dkxvhq7PFqLpq8jkw/image.png" alt="" width="530"><figcaption><p>Rare category protection</p></figcaption></figure>

By default, the rare category protection threshold is set to 10, meaning any value that appears 10 times or fewer will be replaced. The default replacement value is an asterisk (\*), so all values at or below the threshold are replaced with (\*).

## **Ordering and indexing considerations**

To ensure accurate ordering, please see [ordering and indexing considerations](https://docs.syntho.ai/configure-a-data-generation-job/consistent-mapping#ordering-and-indexing-considerations).

## Supported data types

| Generator | Supported data types                                                                          |
| --------- | --------------------------------------------------------------------------------------------- |
| Duplicate | Categorical, Continuous, Discrete, Datetime, Bytes, Bool, UUID, JSON, XML, Geo, Sets, Unknown |
