Free text de-identification

Free text de-identification enables Syntho to automatically detect and anonymize personally identifiable information (PII) hidden in unstructured text columns. This is particularly useful for columns containing names, notes, comments, or descriptions that may include sensitive information.

Caution: Using this feature significantly increases processing time. Consider limiting the number of input rows or enabling GPU acceleration.

When to use

To identify and anonymize PII within text fields like "notes", "comments", or "descriptions"
When working with unstructured data that may contain embedded identifiers
To prepare free-text data for AI-powered generation or duplication without privacy risk

When not to use

When the text only contains a single identifiable value (e.g., just a name or number)
When the text exceeds 1,000 characters or contains complex, domain-specific language
When performance and speed are critical and anonymization of text is not essential

Interactive guide: How to apply a free text de-identification

Follow the interactive guide below to apply a free text de-identification.

PII detection flow

When you apply the PII text scanner to specific columns, Syntho automatically scans for PII elements in those columns. Identified PII elements can then be replaced with mock data. Syntho employs a variety of algorithms and methods to improve the scanning process.

Here's an overview of the steps taken in the detection process, in chronological order:

Regex: for pattern recognition.
Named Entity Recognition (NER): to recognize natural language PII entities.
Checksums: to validate detected patterns.
Context words: to increase detection certainty.
Label: to label detected PII entity with a descriptor of the entity.
(Optional) Obfuscate: to replace detected PII descriptors with mock data.

Supported languages

Under Encoding type > Locale, you can define the locale used by the text processing models for text columns containing PII.

Syntho supports detection and de-identification of PII fields for the languages English and Dutch in columns containing free text data.

Syntho allows adding NLP (natural language processing) models with limited support for different languages (see next section).

Considerations & limitations

PII Detection and Confidence Score: The PII text scanner may identify multiple potential Personally Identifiable Information (PII) entities within a text column. When this occurs, the entity with the highest confidence score is presented to the user. However, it's important to understand that a high confidence score doesn't guarantee accuracy. This could result in mislabeling the type of PII detected.
Internet Requirement for Non-Default NLP Models: If you opt to use specialized Natural Language Processing (NLP) models to accommodate different languages or regions, an active internet connection is necessary to download these models.
Detection Methods: The scanner employs a multi-method approach for PII detection, including the use of Regex patterns, Named Entity Recognition (NER) models, checksum validation, and examination of context words. Note that the effectiveness of the NER models can vary in different context it's being used. For instance, a NER model trained on Wikipedia text may not perform well when applied to medical data.

Free text de-identification helps ensure even your unstructured data is privacy-safe. Use it when working with comments, descriptions, or any column that may include sensitive terms embedded in natural language.

PreviousCalculated columns Next6. Referential integrity & foreign keys

Last updated 3 months ago

Was this helpful?