Automatic PII discovery and de-identification in free text columns

Caution: this feature will slow down your data generation jobs significantly. Consider using a GPU or reducing the number of input records to speed up your job.

For more information about the PII entities that Syntho supports, see Supported PII entities.

Use Syntho PII text scanner

There are two ways to use Syntho's PII text scanner. It can either be used in combination with the column generation method Duplicate or AI-powered generation.

Use PII text scanner with duplicated columns

When using the PII text scanner in combination with the Duplicate generation method, the column will be duplicated after the PII text scanner has been applied. To apply this:

Under Column settings > Generation Method, select Duplicate.
Then, under the dropdown, select the locale to use for detecting the PII entities.
Optionally, enable Replace PII with mock data. When this option is enabled, PII will be replaced with mock values. When this option is disabled, PII will be annotated with a PII label.

Note

When you use the PII text scanner along with the AI-powered generation feature, these steps will occur in the sequence listed below:

Data Preprocessing: Initially, settings like the "Rare category protection threshold" and "replacement value" will be applied to your data.
PII Text Processing: Next, the PII text scanner will go through the data to identify and handle PII.
AI-Powered Generation: Finally, the AI will generate new data, treating the processed text column as if it were a category encoding type.

By understanding this sequence, you can better anticipate what the generated data will look like.

PII detection flow

When you apply the PII text scanner to specific columns, Syntho automatically scans for PII elements in those columns. Identified PII elements can then be replaced with mock data. Syntho employs a variety of algorithms and methods to improve the scanning process.

Here's an overview of the steps taken in the detection process, in chronological order:

Regex: for pattern recognition.
Named Entity Recognition (NER): to recognize natural language PII entities.
Checksums: to validate detected patterns.
Context words: to increase detection certainty.
Label: to label detected PII entity with a descriptor of the entity.
(Optional) Obfuscate: to replace detected PII descriptors with mock data.

Supported languages

Under Encoding type > Locale, you can define the locale used by the text processing models for text columns containing PII.

Syntho supports detection and de-identification of PII fields for the languages English and Dutch in columns containing free text data.

Syntho allows adding NLP (natural language processing) models with limited support for different languages (see next section).

Configure to use other NLP models (limited support)

Note: using non-default NLP models requires having an active internet connection to retrieve those models.

Syntho uses NLP engines for two main tasks: NER-based PII identification, and feature extraction for custom rule based logic (such as leveraging context words for improved detection).

By default, with each deployment, Syntho ships the following open-source models from spaCy:

en_core_web_md for English.
nl_core_news_md for Dutch.
de_core_news_md for German.

These models can be replaced by leveraging other NLP models, either public or proprietary. As its internal NLP engine, Syntho supports both spaCy and Stanza.

This feature can be enabled via the workspace default settings. Hold CTRL + SHIFT + ALT + 0 to open the Workspace Default Settings and enable the model by setting the model_name to any model name as defined in spaCy or Stanza. For example, to use the English transformer spaCy model:

"text_processor_model_settings": {
    "models": [
        {
            "lang_code": "en",
            "model_name": "en_core_web_trf"
        }, ...
    ],
    "nlp_engine_name": "spacy",
     "gpu": false
    }

Optionally, if you have configured a GPU in your deployment setup, the "gpu" parameter can be set to true for faster results.

Other model requests

Other NLP models, such as transformer models, can be added on request with limited support from Syntho. It is important to remember that using other models will impact the PII detection flow and its performance.

Considerations & limitations

PII Detection and Confidence Score: The PII text scanner may identify multiple potential Personally Identifiable Information (PII) entities within a text column. When this occurs, the entity with the highest confidence score is presented to the user. However, it's important to understand that a high confidence score doesn't guarantee accuracy. This could result in mislabeling the type of PII detected.
Internet Requirement for Non-Default NLP Models: If you opt to use specialized Natural Language Processing (NLP) models to accommodate different languages or regions, an active internet connection is necessary to download these models.
Detection Methods: The scanner employs a multi-method approach for PII detection, including the use of Regex patterns, Named Entity Recognition (NER) models, checksum validation, and examination of context words. Note that the effectiveness of the NER models can vary in different context it's being used. For instance, a NER model trained on Wikipedia text may not perform well when applied to medical data.

By understanding these details, you can better navigate how the PII text scanner works and what its limitations may be.

PreviousRemove columns from PII list NextSupported PII & PHI entities

Last updated 3 months ago

Was this helpful?