10. AI synthesis: Data pre-processing when using

Preparing your data

When using AI synthesis, to ensure the highest quality of synthetic data output, proper data preparation is essential. Below are guidelines on how to best prepare your dataset before initiating a generation job.


Preparing your data – entity table

When working with standalone or flat tables, consider the following practices:

  1. Maintain a column-to-row ratio of at least 1:500 This minimizes privacy risks and improves generalization. For example, a table with 6 columns should ideally have a minimum of 3,000 rows.

  2. Each entity should be described in one row One row per unique entity avoids data fragmentation.

  3. Ensure each row is independent The order of rows should not affect the dataset. Each row must be self-contained and analyzable on its own.

  4. Avoid privacy-sensitive column names For instance, do not use names like patient_a_medications. Instead, consolidate sensitive names under generic columns like patient.

  5. Remove derived or redundant columns If one column is a direct function of another (e.g., duration = end_time - start_time), remove the derived column. This also includes categorical redundancies, such as having both treatment and treatment_category.


By adhering to these data preparation guidelines, you ensure that your AI model learns from meaningful patterns, avoids overfitting on redundant information, and respects privacy constraints. This leads to stronger and more reliable synthetic data generation.

Last updated

Was this helpful?