10. AI synthesis: Data pre-processing when using

Preparing your data

When using AI synthesis, to ensure the highest quality of synthetic data output, proper data preparation is essential. Below are guidelines on how to best prepare your dataset before initiating a generation job.


Preparing your data – entity table

When working with standalone or flat tables, consider the following practices:

  1. Maintain a column-to-row ratio of at least 1:500 This minimizes privacy risks and improves generalization. For example, a table with 6 columns should ideally have a minimum of 3,000 rows.

  2. Each entity should be described in one row One row per unique entity avoids data fragmentation.

  3. Ensure each row is independent The order of rows should not affect the dataset. Each row must be self-contained and analyzable on its own.

  4. Avoid privacy-sensitive column names For instance, do not use names like patient_a_medications. Instead, consolidate sensitive names under generic columns like patient.

  5. Remove derived or redundant columns If one column is a direct function of another (e.g., duration = end_time - start_time), remove the derived column. This also includes categorical redundancies, such as having both treatment and treatment_category.


For sequence-based or time-series datasets involving relationships between entities and events:

  1. Use two structured tables

    • An entity table meeting the criteria listed above

    • A linked table containing references to the entity table

  2. Entity table must contain unique IDs These IDs will act as primary keys.

  3. Linked table must include foreign key references Each record in the linked table should refer to a record in the entity table using a foreign key.

  4. Remove directly derived columns Follow the same guideline as for entity tables to avoid redundant or dependent variables.

  5. Avoid row dependencies For example, if each start_date in one row equals the end_date of the previous row, remove one of those to prevent implicit relationships across rows.


By adhering to these data preparation guidelines, you ensure that your AI model learns from meaningful patterns, avoids overfitting on redundant information, and respects privacy constraints, thereby enabling robust and reliable synthetic data generation.

Last updated

Was this helpful?