10. AI synthesis: Data pre-processing when using
Preparing your data
When using AI synthesis, to ensure the highest quality of synthetic data output, proper data preparation is essential. Below are guidelines on how to best prepare your dataset before initiating a generation job.
Preparing your data – entity table
When working with standalone or flat tables, consider the following practices:
Maintain a column-to-row ratio of at least 1:500 This minimizes privacy risks and improves generalization. For example, a table with 6 columns should ideally have a minimum of 3,000 rows.
Each entity should be described in one row One row per unique entity avoids data fragmentation.
Ensure each row is independent The order of rows should not affect the dataset. Each row must be self-contained and analyzable on its own.
Avoid privacy-sensitive column names For instance, do not use names like
patient_a_medications
. Instead, consolidate sensitive names under generic columns likepatient
.Remove derived or redundant columns If one column is a direct function of another (e.g.,
duration = end_time - start_time
), remove the derived column. This also includes categorical redundancies, such as having bothtreatment
andtreatment_category
.
For sequence-based or time-series datasets involving relationships between entities and events:
An entity table meeting the criteria listed above
A linked table containing references to the entity table
Entity table must contain unique IDs These IDs will act as primary keys.
Linked table must include foreign key references Each record in the linked table should refer to a record in the entity table using a foreign key.
Remove directly derived columns Follow the same guideline as for entity tables to avoid redundant or dependent variables.
Avoid row dependencies For example, if each
start_date
in one row equals theend_date
of the previous row, remove one of those to prevent implicit relationships across rows.
By adhering to these data preparation guidelines, you ensure that your AI model learns from meaningful patterns, avoids overfitting on redundant information, and respects privacy constraints, thereby enabling robust and reliable synthetic data generation.
Last updated
Was this helpful?