LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mockers
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Shallow scan (uses metadata)
  • Deep scan (uses metadata + data)
  • Limitations (Deep scan only)
  • Supported PII entities
  • PII scanner parameters
  • Limitations

Was this helpful?

  1. Configure a Data Generation Job
  2. Manage personally identifiable information (PII)
  3. Discover and de-identify PII columns

Automatic PII discovery with PII scanner

PreviousIdentify PII columns manuallyNextRemove columns from PII list

Last updated 13 days ago

Was this helpful?

On the PII tab on the Job Configuration panel, launch a personally identifiable information (PII) scan that scans all columns in your database for PII.

All positive scan results will show up in the list of PII entities on the PII tab.

On the Job Settings tab, PII entities listed on the PII tab are also labeled PII on the column header.

You can launch a metadata PII scan or a data PII scan. The metadata scan runs faster, since it is based on the name of an individual column on which regular expression rules are applied. On the other hand, the data scan is likely to be more accurate in detecting PII columns, since it analyses the data inside the column using state of the art natural language processing models.

Hint: When using the PII scanner, always validate the resulting columns that are marked as PII, because the scanner might mistakenly flag some columns as PII that are not. It might also miss certain PII elements.

Shallow scan (uses metadata)

On the PII tab, select the dropdown icon right of the Start scan button and select Shallow scan.

The shallow scan evaluates all columns available in the database and uses regular expression rules to deduce the type of PII each column might contain. This process is optimized for speed and runs in parallel; therefore, prediction accuracy might sometimes be less accurate.

Due to the nature of the metadata scan, results generally have a high confidence score. This is because they rely on rules established by Syntho. It is possible to add new rules to detect custom-defined PII entities. For more details, please contact your Syntho representative.

Deep scan (uses metadata + data)

On the PII tab, select the dropdown icon right of the Start scan button and select Deep scan.

In some cases, Syntho might not detect PII entities with a shallow scan, especially if the column names aren't descriptive of their content. Creating an exhaustive list of rules is also not always practical. Therefore, Syntho offers an option to scan not just the metadata but also the data within the columns to pinpoint potential PII entities.

Initiating a PII scan first launches a metadata scan. Columns not identified as PII and of type "string" or "text" are then considered for the deep scan. We restrict the scan to these types because our natural language processing (NLP) models are trained to identify and extract PII from textual data, relying on word context for predictions.

Caution: The data PII scanner examines the content in each column, meaning the scan duration increases with the size of the database. To cut down on scanning time, you can limit the number of rows read per column. However, this might adversely affect the scan results.

In comparison with the metadata scan, data scan results may have a lower confidence level. If a column contains multiple PII types, our software calculates the confidence of the column being of a specific PII type based on how frequently that PII type is detected relative to the total number of rows scanned for that column.

Limitations (Deep scan only)

  • The deep scan examines each column of data using natural language processing (NLP) models, which rely on surrounding context to produce accurate results. However, columns containing Personally Identifiable Information (PII), like a First_Name column, typically lack this context. For instance, a First_Name column contains only first names, making it challenging for NLP models to accurately identify them as such without additional context.

Supported PII entities

PII scanner parameters

  • Cardinality toggle, if turned on, helps user to check whether there are as many unique values in a column as there are rows. In that case the column most likely contains PII.

  • When selecting Add, the generation method / column modal will first appear and the user has to adjust/confirm the settings. Afterwards, the Add button will disappear and the wheel icon will appear.

  • The Allowlist enables users to define a list of tokens that should not be marked as PII even if we want to identify other tokens of that entity type.

  • The Add new PII entity will launch a modal that allows the user to create a new PII entity, by filling in three fields with

    1. a name for the user-defined entity,

    2. a RegEx (or list of words),

    3. a confidence percentage.

  • The PII entities to look for, is a multi-select dropdown (same as schema dropdown) showing all the available PII entities (including entities created by the user).

  • The PII scan acceptance threshold slider can be used to control the PII entities that are shown to the user.

  • The Learn more about PII button will forward the user to the PII section in Syntho's User Documentation

Moreover:

  • When defining the locale in the PII scan, please use that locale as a default also for all the suggested mockers out of the PII scan.

Additionally, take into consideration below points:

  • If column headers are red for PII columns, it means that there is no Mocker, Mask, Calculated Column or Exclude applied.

  • An exclamation mark (!) next to the table in the left, under table overview panel appears if that table has columns with PII labels on Duplicate (with no Mocker or Exclude applied). The (!) mark informs user that this table has columns that are labelled as Personally Identifiable Information (PII) and if user proceed, this PII will be duplicated, which could lead to unintentional sharing of sensitive data. To avoid this, user have two options:

    1. Apply a Mocker.

    2. Exclude the PII column(s).

Exclamation mark (!) helps user to understand that tables that are marked as de-identify must be de-identified. Please note that de-identification is equal to excluding or mocking PII columns. When there are PII columns that are not handled by applying a mocker or excluding it, it is at risk. Hence, the PII label is red and the table has an exclamation mark symbol next to it.

Limitations

  • Scanner Accuracy: The accuracy of the PII scanner depends on the metadata, data, and PII type. For more accurate PII detection, it is most effective to use descriptive column names like FirstName instead of generic names like Col1.

  • Multiple PII Detections: The PII scanner can identify several possible PII entries in a single column. Be aware that the top-scored entry might not always be correct and this could lead to either misidentifying a non-PII item or wrongly categorizing the PII type.

Understanding these points will help you better utilize the PII scanner and be aware of its limitations.

Caution: The PII scanner is an excellent tool for initial PII detection, but it may not catch all sensitive data. Users are advised to conduct a thorough review to ensure all PII is properly identified.

For more information about the PII entities that Syntho supports, see .

Selecting PII Types: Currently, users cannot specify which types of PII entities to scan for. For a comprehensive list of the types of PII entities that Syntho scans for, please refer to the section .

Supported PII entities
Supported PII entities
Shallow scan in Scan mode dropdown
Columns "FIRSTNAME" and "MAIDENNAME" is detected as PII but no Mocker, Mask, Calculated Column or Exclude is applied