LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. AI synthesis: Data pre-processing when using
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
      • AI-generated synthetic data
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mock
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
      • Backup
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
      • Backup
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page

Was this helpful?

  1. Overview
  2. Get started
  3. Syntho bootcamp
  4. 5. Generators

Free text de-identification

PreviousCalculated columnsNext6. Referential integrity & foreign keys

Last updated 16 days ago

Was this helpful?

enables Syntho to automatically detect and anonymize personally identifiable information (PII) hidden in unstructured text columns. This is particularly useful for columns containing names, notes, comments, or descriptions that may include sensitive information.

Caution: Using this feature significantly increases processing time. Consider limiting the number of input rows or enabling GPU acceleration.

When to use

  • To identify and anonymize PII within text fields like "notes", "comments", or "descriptions"

  • When working with unstructured data that may contain embedded identifiers

  • To prepare free-text data for AI-powered generation or duplication without privacy risk

When not to use

  • When the text only contains a single identifiable value (e.g., just a name or number)

  • When the text exceeds 1,000 characters or contains complex, domain-specific language

  • When performance and speed are critical and anonymization of text is not essential


Interactive guide: How to apply a free text de-identification

Follow the interactive guide below to apply a free text de-identification.


Free text de-identification helps ensure even your unstructured data is privacy-safe. Use it when working with comments, descriptions, or any column that may include sensitive terms embedded in natural language.

Free text de-identification

Supported languages

Under Encoding type > Locale, you can define the locale used by the text processing models for text columns containing PII.

Syntho supports detection and de-identification of PII fields for the languages English and Dutch in columns containing free text data.

Syntho allows adding NLP (natural language processing) models with limited support for different languages (see next section).

PII detection flow

When you apply the PII text scanner to specific columns, Syntho automatically scans for PII elements in those columns. Identified PII elements can then be replaced with mock data. Syntho employs a variety of algorithms and methods to improve the scanning process.

Here's an overview of the steps taken in the detection process, in chronological order:

  1. Regex: for pattern recognition.

  2. Named Entity Recognition (NER): to recognize natural language PII entities.

  3. Checksums: to validate detected patterns.

  4. Context words: to increase detection certainty.

  5. Label: to label detected PII entity with a descriptor of the entity.

  6. (Optional) Obfuscate: to replace detected PII descriptors with mock data.

Considerations & limitations

  • PII Detection and Confidence Score: The PII text scanner may identify multiple potential Personally Identifiable Information (PII) entities within a text column. When this occurs, the entity with the highest confidence score is presented to the user. However, it's important to understand that a high confidence score doesn't guarantee accuracy. This could result in mislabeling the type of PII detected.

  • Internet Requirement for Non-Default NLP Models: If you opt to use specialized Natural Language Processing (NLP) models to accommodate different languages or regions, an active internet connection is necessary to download these models.

  • Detection Methods: The scanner employs a multi-method approach for PII detection, including the use of Regex patterns, Named Entity Recognition (NER) models, checksum validation, and examination of context words. Note that the effectiveness of the NER models can vary in different context it's being used. For instance, a NER model trained on Wikipedia text may not perform well when applied to medical data.