LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mockers
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Apply duplicate
  • Shuffle data
  • Detect and obfuscate PII
  • Rare category protection
  • Ordering and indexing considerations
  • Supported data types

Was this helpful?

  1. Configure a Data Generation Job
  2. Configure column settings

Duplicate

PreviousUUIDNextExclude

Last updated 15 days ago

Was this helpful?

Duplicate can be especially useful in the following situations:

  1. When data does not contain personally identifiable information (PII) or sensitive elements, duplicating it allows for efficient replication without modification.

Apply duplicate

  1. Open your Workspace.

  2. On the Job Configuration tab, select the column icon on the top left of the column where you want to duplicate.

  3. Under Column settings > Generation Method, select Duplicate to copy the column from the source table to the destination table as-is.

  4. Set the relevant duplicate parameters.

  5. Select Confirm.

Note: When you duplicate a column, the column is still used during the training process, as it can contain valuable information.

This means, however, that excluding columns cannot be used to to reduce hardware requirements or increase the speed of your synthetic data jobs.

Shuffle data

Enable the Shuffle button to shuffle the generated values, while maintaining the overall frequency of values. For example, if you have 4 High, 3 Medium and 5 Low values in the source database, the same counts of values will exist in the destination database, except they are shuffled appear in a different order.

Note that the shuffle functionality works batch-wise in batches, so each batch generation according to the Generation Batch Size batch is shuffled independently. according to the set Generation Batch Size (the default value is 100k).

Note that NULL values are also considered a distinct value, and will be shuffled like any other value.

Detect and obfuscate PII

When enabled, select the correct Locale, as based on the data in your text column, to ensure Syntho uses the appropriate language models to identify and obfuscate PII in your text column.

After enabling this options and setting the right locale, any identified PII entities are obfuscated and then copied to the destination table.

Rare category protection

Syntho automatically replaces any infrequent categorical values in a column with a user-defined value, ensuring that sensitive data does not appear in the synthetic output.

  • Rare category protection threshold: Column values that appear with a frequency at or below this threshold are automatically replaced to prevent data leakage.

  • Rare category replacement value: Values meeting the frequency threshold are substituted with this user-specified replacement value.

By default, the rare category protection threshold is set to 10, meaning any value that appears 10 times or fewer will be replaced. The default replacement value is an asterisk (*), so all values at or below the threshold are replaced with (*).

Ordering and indexing considerations

Supported data types

Generator
Supported data types

Duplicate

Categorical, Continuous, Discrete, Datetime, Bytes, Bool, UUID, JSON, XML, Geo, Sets, Unknown

Caution: Using the same underlying modelling techniques as the , the Detect and obfuscate PII feature can take very long to run.

Enable the toggle Detect and obfuscate PII to use Syntho's to detect and obfuscate PII entities in columns containing free text information.

To ensure accurate ordering, please see .

PII text obfuscation module
PII text obfuscation module
ordering and indexing considerations
Selecting Duplicate in Generation Method panel
Detect and obfuscate PII
Rare category protection