LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. AI synthesis: Data pre-processing when using
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
      • AI-generated synthetic data
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mock
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
      • Backup
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
      • Backup
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Preparing your data
  • Preparing your data – entity table
  • Preparing your data – entity & linked table

Was this helpful?

  1. Overview
  2. Get started
  3. Syntho bootcamp

10. AI synthesis: Data pre-processing when using

Previous9. Large workloads​NextPrerequisites

Last updated 1 day ago

Was this helpful?

Preparing your data

When using AI synthesis, to ensure the highest quality of synthetic data output, proper data preparation is essential. Below are guidelines on how to best prepare your dataset before initiating a generation job.


Preparing your data – entity table

When working with standalone or flat tables, consider the following practices:

  1. This minimizes privacy risks and improves generalization. For example, a table with 6 columns should ideally have a minimum of 3,000 rows.

  2. One row per unique entity avoids data fragmentation.

  3. The order of rows should not affect the dataset. Each row must be self-contained and analyzable on its own.

  4. For instance, do not use names like patient_a_medications. Instead, consolidate sensitive names under generic columns like patient.

  5. If one column is a direct function of another (e.g., duration = end_time - start_time), remove the derived column. This also includes categorical redundancies, such as having both treatment and treatment_category.


For sequence-based or time-series datasets involving relationships between entities and events:

    • An entity table meeting the criteria listed above

    • A linked table containing references to the entity table

  1. These IDs will act as primary keys.

  2. Each record in the linked table should refer to a record in the entity table using a foreign key.

  3. Follow the same guideline as for entity tables to avoid redundant or dependent variables.

  4. For example, if each start_date in one row equals the end_date of the previous row, remove one of those to prevent implicit relationships across rows.


By adhering to these data preparation guidelines, you ensure that your AI model learns from meaningful patterns, avoids overfitting on redundant information, and respects privacy constraints, thereby enabling robust and reliable synthetic data generation.

Maintain a column-to-row ratio of at least 1:500
Each entity should be described in one row
Ensure each row is independent
Avoid privacy-sensitive column names
Remove derived or redundant columns
Preparing your data – entity & linked table
Use two structured tables
Entity table must contain unique IDs
Linked table must include foreign key references
Remove directly derived columns
Avoid row dependencies