10. Data pre-processing
Last updated
Was this helpful?
Last updated
Was this helpful?
To ensure the highest quality of synthetic data output, proper data preparation is essential—especially when using AI-powered generation. This step helps reduce noise, improve generalization, and minimize risks associated with identifiable or redundant data patterns. Below are guidelines on how to best prepare your dataset before initiating a generation job.
When working with standalone or flat tables, consider the following practices:
This minimizes privacy risks and improves generalization. For example, a table with 6 columns should ideally have a minimum of 3,000 rows.
One row per unique entity avoids data fragmentation.
The order of rows should not affect the dataset. Each row must be self-contained and analyzable on its own.
For instance, do not use names like patient_a_medications
. Instead, consolidate sensitive names under generic columns like patient
.
If one column is a direct function of another (e.g., duration = end_time - start_time
), remove the derived column. This also includes categorical redundancies, such as having both treatment
and treatment_category
.
For sequence-based or time-series datasets involving relationships between entities and events:
An entity table meeting the criteria listed above
A linked table containing references to the entity table
These IDs will act as primary keys.
Each record in the linked table should refer to a record in the entity table using a foreign key.
Follow the same guideline as for entity tables to avoid redundant or dependent variables.
For example, if each start_date
in one row equals the end_date
of the previous row, remove one of those to prevent implicit relationships across rows.
By adhering to these data preparation guidelines, you ensure that your AI model learns from meaningful patterns, avoids overfitting on redundant information, and respects privacy constraints, thereby enabling robust and reliable synthetic data generation.