# Detect and obfuscate PII

{% hint style="warning" %}
**Caution**: this feature will slow down your data generation jobs significantly. Consider using a GPU or reducing the number of input records to speed up your job.
{% endhint %}

For more information about the PII entities that Syntho supports, see [Supported PII entities](https://docs.syntho.ai/configure-a-data-generation-job/manage-personally-identifiable-information-pii/supported-pii-entities).

## Use Syntho PII text scanner

There are two ways to use Syntho's PII text scanner. It can either be used in combination with the column generation method **Duplicate** or **AI-powered generation**.

### Use PII text scanner with duplicated columns

When using the PII text scanner in combination with the **Duplicate** generation method, the column will be duplicated after the PII text scanner has been applied. To apply this:

1. Under **Column settings** > **Generation Method**, select **Duplicate.**
2. Then, under the dropdown, **select the locale** to use for detecting the PII entities.
3. Optionally, enable **Replace PII with mock data.** When this option is enabled, PII will be replaced with mock values. When this option is ***disabled***, PII will be annotated with a PII label.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/IQre8k9Ld1QGBzJ8OyRX/image.png" alt="" width="508"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

When you use the PII text scanner along with the AI-powered generation feature, these steps will occur in the sequence listed below:

1. **Data Preprocessing**: Initially, settings like the "Rare category protection threshold" and "replacement value" will be applied to your data.
2. **PII Text Processing**: Next, the PII text scanner will go through the data to identify and handle PII.
3. **AI-Powered Generation**: Finally, the AI will generate new data, treating the processed text column as if it were a category encoding type.

By understanding this sequence, you can better anticipate what the generated data will look like.
{% endhint %}

## **PII detection flow**

When you apply the PII text scanner to specific columns, Syntho automatically scans for PII elements in those columns. Identified PII elements can then be replaced with mock data. Syntho employs a variety of algorithms and methods to improve the scanning process.

Here's an overview of the steps taken in the detection process, in chronological order:

1. **Regex**: for pattern recognition.
2. **Named Entity Recognition** **(NER)**: to recognize natural language PII entities.
3. **Checksums**: to validate detected patterns.
4. **Context words**: to increase detection certainty.
5. **Label**: to label detected PII entity with a descriptor of the entity.
6. **(Optional) Obfuscate**: to replace detected PII descriptors with mock data.

## Supported languages

Under **Encoding type > Locale**, you can define the locale used by the text processing models for text columns containing PII.

Syntho supports detection and de-identification of PII fields for the languages **English** and **Dutch** in columns containing free text data.

Syntho allows adding **NLP (natural language processing)** models with limited support for different languages (see next section).

## Configure to use other NLP models (limited support)

{% hint style="info" %}
**Note**: using non-default NLP models requires having an active internet connection to retrieve those models.
{% endhint %}

Syntho uses NLP engines for two main tasks: NER-based PII identification, and feature extraction for custom rule based logic (such as leveraging context words for improved detection).

By default, with each deployment, Syntho ships the following open-source models from spaCy:

* `en_core_web_md` for English.
* `nl_core_news_md` for Dutch.
* `de_core_news_md` for German.

These models can be replaced by leveraging other NLP models, either public or proprietary. As its internal NLP engine, Syntho supports both [spaCy](https://spacy.io/usage/models) and [Stanza](https://github.com/stanfordnlp/stanza).

This feature can be enabled via the workspace default settings. Hold **CTRL + SHIFT + ALT + 0** to open the **Workspace Default Settings** and enable the model by setting the **model\_name** to any model name as defined in [spaCy](https://spacy.io/usage/models) or [Stanza](https://github.com/stanfordnlp/stanza). For example, to use the English transformer spaCy model:

```
"text_processor_model_settings": {
    "models": [
        {
            "lang_code": "en",
            "model_name": "en_core_web_trf"
        }, ...
    ],
    "nlp_engine_name": "spacy",
     "gpu": false
    }
```

Optionally, if you have configured a GPU in your deployment setup, the `"gpu"` parameter can be set to `true` for faster results.

### Other model requests

Other NLP models, such as [transformer models](https://github.com/huggingface/transformers), can be added on request with limited support from Syntho. It is important to remember that using other models will impact the PII detection flow and its performance.

## Considerations & limitations

* **PII Detection and Confidence Score:** The PII text scanner may identify multiple potential **Personally Identifiable Information (PII)** entities within a text column. When this occurs, the entity with the highest confidence score is presented to the user. However, it's important to understand that a high confidence score doesn't guarantee accuracy. This could result in mislabeling the type of PII detected.
* **Internet Requirement for Non-Default NLP Models:** If you opt to use specialized **Natural Language Processing (NLP)** models to accommodate different languages or regions, an active internet connection is necessary to download these models.
* **Detection Methods:** The scanner employs a multi-method approach for PII detection, including the use of **Regex** patterns, **Named Entity Recognition (NER)** models, checksum validation, and examination of context words. Note that the effectiveness of the NER models can vary in different context it's being used. For instance, a NER model trained on Wikipedia text may not perform well when applied to medical data.

By understanding these details, you can better navigate how the PII text scanner works and what its limitations may be.
