Free text de-identification
Last updated
Was this helpful?
Last updated
Was this helpful?
Free text de-identification enables Syntho to automatically detect and anonymize personally identifiable information (PII) hidden in unstructured text columns. This is particularly useful for columns containing names, notes, comments, or descriptions that may include sensitive information.
Caution: Using this feature significantly increases processing time. Consider limiting the number of input rows or enabling GPU acceleration.
To identify and anonymize PII within text fields like "notes", "comments", or "descriptions"
When working with unstructured data that may contain embedded identifiers
To prepare free-text data for AI-powered generation or duplication without privacy risk
Follow the interactive guide below to apply a free text de-identification.
Free text de-identification helps ensure even your unstructured data is privacy-safe. Use it when working with comments, descriptions, or any column that may include sensitive terms embedded in natural language.
Under Encoding type > Locale, you can define the locale used by the text processing models for text columns containing PII.
Syntho supports detection and de-identification of PII fields for the languages English and Dutch in columns containing free text data.
Syntho allows adding NLP (natural language processing) models with limited support for different languages (see next section).
When you apply the PII text scanner to specific columns, Syntho automatically scans for PII elements in those columns. Identified PII elements can then be replaced with mock data. Syntho employs a variety of algorithms and methods to improve the scanning process.
Here's an overview of the steps taken in the detection process, in chronological order:
Regex: for pattern recognition.
Named Entity Recognition (NER): to recognize natural language PII entities.
Checksums: to validate detected patterns.
Context words: to increase detection certainty.
Label: to label detected PII entity with a descriptor of the entity.
(Optional) Obfuscate: to replace detected PII descriptors with mock data.
PII Detection and Confidence Score: The PII text scanner may identify multiple potential Personally Identifiable Information (PII) entities within a text column. When this occurs, the entity with the highest confidence score is presented to the user. However, it's important to understand that a high confidence score doesn't guarantee accuracy. This could result in mislabeling the type of PII detected.
Internet Requirement for Non-Default NLP Models: If you opt to use specialized Natural Language Processing (NLP) models to accommodate different languages or regions, an active internet connection is necessary to download these models.
Detection Methods: The scanner employs a multi-method approach for PII detection, including the use of Regex patterns, Named Entity Recognition (NER) models, checksum validation, and examination of context words. Note that the effectiveness of the NER models can vary in different context it's being used. For instance, a NER model trained on Wikipedia text may not perform well when applied to medical data.