Automatic PII discovery and de-identification in free text columns
Last updated
Last updated
Caution: this feature will slow down your data generation jobs significantly. Consider using a GPU or reducing the number of input records to speed up your job.
For more information about the PII entities that Syntho supports, see Supported PII entities.
There are two ways to use Syntho's PII text scanner. It can either be used in combination with the column generation method Duplicate or AI-powered generation.
When using the PII text scanner in combination with the Duplicate generation method, the column will be duplicated after the PII text scanner has been applied. To apply this:
Under Column settings > Generation Method, select Duplicate.
Then, under the dropdown, select the locale to use for detecting the PII entities.
Optionally, enable Replace PII with mock data. When this option is enabled, PII will be replaced with mock values. When this option is disabled, PII will be annotated with a PII label.
Note
When you use the PII text scanner along with the AI-powered generation feature, these steps will occur in the sequence listed below:
Data Preprocessing: Initially, settings like the "Rare category protection threshold" and "replacement value" will be applied to your data.
PII Text Processing: Next, the PII text scanner will go through the data to identify and handle PII.
AI-Powered Generation: Finally, the AI will generate new data, treating the processed text column as if it were a category encoding type.
By understanding this sequence, you can better anticipate what the generated data will look like.
When you apply the PII text scanner to specific columns, Syntho automatically scans for PII elements in those columns. Identified PII elements can then be replaced with mock data. Syntho employs a variety of algorithms and methods to improve the scanning process.
Here's an overview of the steps taken in the detection process, in chronological order:
Regex: for pattern recognition.
Named Entity Recognition (NER): to recognize natural language PII entities.
Checksums: to validate detected patterns.
Context words: to increase detection certainty.
Label: to label detected PII entity with a descriptor of the entity.
(Optional) Obfuscate: to replace detected PII descriptors with mock data.
Under Encoding type > Locale, you can define the locale used by the text processing models for text columns containing PII.
Syntho supports detection and de-identification of PII fields for the languages English and Dutch in columns containing free text data.
Syntho allows adding NLP (natural language processing) models with limited support for different languages (see next section).
Note: using non-default NLP models requires having an active internet connection to retrieve those models.
Syntho uses NLP engines for two main tasks: NER-based PII identification, and feature extraction for custom rule based logic (such as leveraging context words for improved detection).
By default, with each deployment, Syntho ships the following open-source models from spaCy:
en_core_web_md
for English.
nl_core_news_md
for Dutch.
de_core_news_md
for German.
These models can be replaced by leveraging other NLP models, either public or proprietary. As its internal NLP engine, Syntho supports both spaCy and Stanza.
This feature can be enabled via the workspace default settings. Hold CTRL + SHIFT + ALT + 0 to open the Workspace Default Settings and enable the model by setting the model_name to any model name as defined in spaCy or Stanza. For example, to use the English transformer spaCy model:
Optionally, if you have configured a GPU in your deployment setup, the "gpu"
parameter can be set to true
for faster results.
Other NLP models, such as transformer models, can be added on request with limited support from Syntho. It is important to remember that using other models will impact the PII detection flow and its performance.
PII Detection and Confidence Score: The PII text scanner may identify multiple potential Personally Identifiable Information (PII) entities within a text column. When this occurs, the entity with the highest confidence score is presented to the user. However, it's important to understand that a high confidence score doesn't guarantee accuracy. This could result in mislabeling the type of PII detected.
Internet Requirement for Non-Default NLP Models: If you opt to use specialized Natural Language Processing (NLP) models to accommodate different languages or regions, an active internet connection is necessary to download these models.
Detection Methods: The scanner employs a multi-method approach for PII detection, including the use of Regex patterns, Named Entity Recognition (NER) models, checksum validation, and examination of context words. Note that the effectiveness of the NER models can vary in different context it's being used. For instance, a NER model trained on Wikipedia text may not perform well when applied to medical data.
By understanding these details, you can better navigate how the PII text scanner works and what its limitations may be.