LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII Scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mockers
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Prerequisites
  • Preparing your data
  • Configuring Column Settings
  • Rare category protection
  • PII scanner and Mockers
  • PII scanner
  • Mockers
  • Advanced generator settings
  • Start data generation process
  • Model parameters
  • Truncate tables before each new data generation job
  • Evaluation

Was this helpful?

  1. Overview
  2. Get started

AI-generated synthetic data

This guide provides the step-by-step procedures for AI-generated synthetic data for a single entity table.

Last updated 8 days ago

Was this helpful?

The diagram below illustrates a workflow for AI-generated synthetic data. Detailed information about each step shown in the diagram is provided throughout this page.

Before starting AI-generated synthetic data use case, check the video below that provides a short introduction to data generators.

For this key use case, a single table, named census, containing census-collected data, requires to be synthesized using Syntho's AI-powered generation. Maximum privacy, alongside the generation of highly realistic data that statistically reflects the original dataset, is crucial for AI and analytics. The initial step involves creating a workspace in Syntho, linked to a census database. Once established, this workspace will feature the census table exclusively. Below, the table's columns and a sample of the contained data are presented.



Preparing your data

For AI-powered synthetic data generation, ensure your data is fit to synthesize. Syntho expects your data to be stored in an entity table that adheres to specific guidelines:

  1. Maintain a minimum column-to-row ratio of 1:500 for privacy and algorithmic generalization. With 15 columns, aim for at least 7,500 rows; our example database exceeds this with 48,842 rows (see below illustration).

  2. Describe each entity in a single row, ensuring row independence without sequential information.

  3. Use generic column names to avoid exposing sensitive information, replacing specific patient identifiers, like “patient_a_medications” with a patient column with the names.

  4. Eliminate columns derived from others to improve modelling and enhance synthetic data quality.

Our example table fully meets these criteria.

Configuring Column Settings

Under Column settings > Generation Method, select AI-powered generator for Syntho's ML models to synthesize data.

This feature helps hide sensitive or rare observations, like specific occupations in the census table, by replacing them with a user-defined value, enhancing data privacy. Adjust the rare category protection threshold and replacement value in Column settings > Encoding type > Advanced settings.

  • Rare category protection threshold: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.

  • Rare category replacement value: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

For example, occupations appearing as frequently or less than 15, will be replaced with the sign asterisk or “*”. The number and replacement value are voluntarily and can be defined per user request (See below illustration).

The PII scanner provides a starting point for PII detection. Users should perform additional reviews to identify and handle any other sensitive data that may not be detected by the scanner.

In the PII tab, you can add new columns to the list of PII columns, either manually or by using Syntho's PII scanner. You have the option to manually label columns containing PII by selecting the column name and optionally choosing a mocker to apply. Clicking "Confirm" will mark the column as containing PII and confirm the mocker selection.

Alternatively, deploy automatic PII discovery with the PII scanner. Launch a scan to detect PII across all database columns from the PII tab in the Job Configuration panel. Note that the scanner offers both Shallow and Deep scan modes:

  • The shallow scan assesses columns using regular expression rules to identify PII, optimized for speed but with variable accuracy.

  • The deep scan examines both metadata and data within columns for a thorough PII identification.

In the job settings, under table settings, you can adjust generator-level settings, including the maximum number of rows for training to optimize speed. Leaving this setting as None utilizes all rows. The Take random sample option allows for sampling:

  • On: Random rows are selected for training.

  • Off: Top rows as per the database are used.

Start data generation process

To start data generation, you can do the following:

  1. On the Job configuration panel, select Generate.

  2. On the Job configuration summary panel, adjust generation parameters as desired.

  3. Finally, select Start generating.

Before initiating the generation process, you have the option to modify model parameters. Here's an overview:

  • Read batch size: The number of rows read from each source table per batch.

  • Write batch size: The number of rows inserted into each destination table per batch.

  • N connections: Specifies the number of connections.

Truncate tables before each new data generation job

Users are required to manually TRUNCATE their tables in the DESTINATION database before initiating each new data generation job. If truncation is hindered due to existing constraints, these constraints should be temporarily disabled before truncation and then re-enabled afterwards. For instance, to facilitate the truncation process when foreign key constraints prevent it, use the following SQL commands: First, disable the constraints by executing SET FOREIGN_KEY_CHECKS = 0;, then TRUNCATE the table, and finally, re-enable the constraints with SET FOREIGN_KEY_CHECKS = 1;. This sequence ensures that tables are properly prepared for data generation without constraint violations.

Evaluation

In the notebook below, we compare some real and synthetic demo data using SDMetrics based on SDMetrics original notebook. Also, you will find a shareable report, and you can use it to discover some insights and create visual graphics.

For prerequisites check or watch the video below.

and

As with de-identification, you can use mockers or exclude to replace any PII columns. If not, those PII columns will be treated as categorical columns and processed by Syntho's “”.

is an Python library designed for evaluating tabular synthetic data to determine how closely it mimics the mathematical properties of real data, known as synthetic data fidelity, through selective metric evaluation, detailed results explanation, score visualization, and report sharing capabilities.

For full metrics and information, please check the SDMetrics documentation by clicking on this .

Prerequisites
Prerequisites
Rare category protection
PII scanner
Mockers
PII scanner
Mockers
Advanced generator settings
Model parameters
SDMetrics
open-source
link
Rare category protection
A short introduction to data generators
Prerequisites for AI generated synthetic data
Preparing your data
Configure table settings
Rare category protection
PII scanner and Mockers
Start a data generation process
73KB
SDV-evaluation-notebook-metrics.ipynb
Jupyter Notebook
Workflow of AI-generated synthetic data process
Table census
Calculating rows
Column occupation
Rare category protection
Manual PII labelling
Automatic PII labelling
Model parameters