Automatic PII discovery with PII scanner

On the PII tab on the Job Configuration panel, launch a personally identifiable information (PII) scan that scans all columns in your database for PII.

All positive scan results will show up in the list of PII entities on the PII tab.

On the Job Settings tab, PII entities listed on the PII tab are also labeled PII on the column header.

You can launch a metadata PII scan or a data PII scan. The metadata scan runs faster, since it is based on the name of an individual column on which regular expression rules are applied. On the other hand, the data scan is likely to be more accurate in detecting PII columns, since it analyses the data inside the column using state of the art natural language processing models.

Hint: When using the PII scanner, always validate the resulting columns that are marked as PII, because the scanner might mistakenly flag some columns as PII that are not. It might also miss certain PII elements.

Shallow scan (uses metadata)

On the PII tab, select the dropdown icon right of the Start scan button and select Shallow scan.

The shallow scan evaluates all columns available in the database and uses regular expression rules to deduce the type of PII each column might contain. This process is optimized for speed and runs in parallel; therefore, prediction accuracy might sometimes be less accurate.

Due to the nature of the metadata scan, results generally have a high confidence score. This is because they rely on rules established by Syntho. It is possible to add new rules to detect custom-defined PII entities. For more details, please contact your Syntho representative.

Deep scan (uses metadata + data)

On the PII tab, select the dropdown icon right of the Start scan button and select Deep scan.

In some cases, Syntho might not detect PII entities with a shallow scan, especially if the column names aren't descriptive of their content. Creating an exhaustive list of rules is also not always practical. Therefore, Syntho offers an option to scan not just the metadata but also the data within the columns to pinpoint potential PII entities.

Initiating a PII scan first launches a metadata scan. Columns not identified as PII and of type "string" or "text" are then considered for the deep scan. We restrict the scan to these types because our natural language processing (NLP) models are trained to identify and extract PII from textual data, relying on word context for predictions.

Caution: The data PII scanner examines the content in each column, meaning the scan duration increases with the size of the database. To cut down on scanning time, you can limit the number of rows read per column. However, this might adversely affect the scan results.

In comparison with the metadata scan, data scan results may have a lower confidence level. If a column contains multiple PII types, our software calculates the confidence of the column being of a specific PII type based on how frequently that PII type is detected relative to the total number of rows scanned for that column.

Limitations (Deep scan only)

  • The deep scan examines each column of data using natural language processing (NLP) models, which rely on surrounding context to produce accurate results. However, columns containing Personally Identifiable Information (PII), like a First_Name column, typically lack this context. For instance, a First_Name column contains only first names, making it challenging for NLP models to accurately identify them as such without additional context.

Supported PII entities

For more information about the PII entities that Syntho supports, see Supported PII entities.

PII scanner parameters

  • Cardinality toggle, if turned on, helps user to check whether there are as many unique values in a column as there are rows. In that case the column most likely contains PII.

  • When selecting Add, the generation method / column modal will first appear and the user has to adjust/confirm the settings. Afterwards, the Add button will disappear and the wheel icon will appear.

  • The Allowlist enables users to define a list of tokens that should not be marked as PII even if we want to identify other tokens of that entity type.

  • The Add new PII entity will launch a modal that allows the user to create a new PII entity, by filling in three fields with

    1. a name for the user-defined entity,

    2. a RegEx (or list of words),

    3. a confidence percentage.

  • The PII entities to look for, is a multi-select dropdown (same as schema dropdown) showing all the available PII entities (including entities created by the user).

  • The PII scan acceptance threshold slider can be used to control the PII entities that are shown to the user.

  • The Learn more about PII button will forward the user to the PII section in Syntho's User Documentation

Moreover:

  • When defining the locale in the PII scan, please use that locale as a default also for all the suggested mockers out of the PII scan.

Additionally, take into consideration below points:

  • If column headers are red for PII columns under "de-identify", it means that there is no Mocker or Exclude applied

  • An exclamation mark (!) next to the table in the left, under table overview panel appears if that table is under "de-identify" and has columns with PII labels on Duplicate (with no Mocker or Exclude applied). The (!) mark informs user that this table has columns that are labelled as Personally Identifiable Information (PII) and if user proceed, this PII will be duplicated, which could lead to unintentional sharing of sensitive data. To avoid this, user have two options:

    1. Apply a Mocker.

    2. Exclude the PII column(s).

Exclamation mark (!) helps user to understand that tables that are marked as de-identify must be de-identified. Please note that de-identification is equal to excluding or mocking PII columns. When there are PII columns that are not handled by applying a mocker or excluding it, it is at risk. Hence, the PII label is red and the table has an exclamation mark symbol next to it.

Limitations

  • Scanner Accuracy: The accuracy of the PII scanner depends on the metadata, data, and PII type. For more accurate PII detection, it is most effective to use descriptive column names like FirstName instead of generic names like Col1.

  • Multiple PII Detections: The PII scanner can identify several possible PII entries in a single column. Be aware that the top-scored entry might not always be correct and this could lead to either misidentifying a non-PII item or wrongly categorizing the PII type.

  • Selecting PII Types: Currently, users cannot specify which types of PII entities to scan for. For a comprehensive list of the types of PII entities that Syntho scans for, please refer to the section Supported PII entities.

Understanding these points will help you better utilize the PII scanner and be aware of its limitations.

Caution: The PII scanner is an excellent tool for initial PII detection, but it may not catch all sensitive data. Users are advised to conduct a thorough review to ensure all PII is properly identified.

Last updated