Differences between key generators

Here we aim to visualize the differences between various key generators and their impact on preserving referential integrity and cross-table relationships, using simplified sample data for illustration purposes.

De-identification with key generator: duplicate

In this method, the primary keys (PKs) and foreign keys (FKs) are duplicated exactly as they are in the source data, preserving the original relationships.

Preservation of Keys: The primary keys (ID) and foreign keys (Patient ID) in the de-identified data are exact duplicates of those in the original data.

Referential Integrity: Since the keys are duplicated, the referential integrity is maintained, ensuring that each foreign key in the Medications table corresponds to an existing primary key in the Patients table.

De-identification with key generator: hash

In this method, the primary keys (PKs) and foreign keys (FKs) are transformed using a hash function. This preserves the referential integrity while anonymizing the keys. The primary keys and foreign keys will be hashed, preserving the relationship between tables while anonymizing the keys.

Preservation of Keys: The primary keys (ID) and foreign keys (Patient ID) are transformed using a hash function, ensuring they are anonymized while preserving their referential integrity. The hash data is dependent on the data type. The relationships between the Patients and Medications tables are maintained because the hashed foreign keys in the Medications table correspond to the hashed primary keys in the Patients table.

Referential Integrity: The relationships between the Patients and Medications tables are preserved because the hashed foreign keys in the Medications table match the hashed primary keys in the Patients table.

Consistency: The hash function consistently maps the same original key to the same hashed key, ensuring consistency of hashed values across tables, databases and data generation jobs.

De-identification with key generator: generate (not recommended)

In this method, entirely new key values are generated, which preserves referential integrity but does not maintain the original order of the key values. New keys will be generated for both primary keys and foreign keys, preserving referential integrity but not the original order.

Generation of New Keys: New primary keys (ID) and foreign keys (Patient ID) are generated, ensuring they are unique but not maintaining their original order. This key method is generally not recommended in combination with de-identification.

Referential Integrity: The relationships between the Patients and Medications tables are preserved because the foreign keys in the Medications table correspond to the new primary keys in the Patients table.

Order of Keys: The new keys do not maintain the original order. The foreign keys in the Medications table are generated based on the primary keys in the Patients table and then uniformly assigned to the Medications table using a "tiling" method. This means the IDs will repeat in a uniform pattern, such as 101, 102, 103, 104, 105, 101, 102, 103, 104, 105, etc.

Synthesize with sequence model with key generator: generate

In this method, entirely new key values and combinations of non-key values are generated, which preserves referential integrity and relationships between all columns. Statistical properties and relationships with other columns are also preserved. New values are generated for both primary keys, foreign keys, and combinations of non-key values. Relationships between all columns are preserved.

Generation of New Values: New primary keys (ID) and foreign keys (Patient ID) are generated. The generated rows (i.e., combinations of column values) are new, such that there is no 1:to:1 relationship to original records. This is a key privacy benefit of AI-generated synthetic data.

Referential Integrity: The referential integrity between the Patients and Medications tables are preserved because the foreign keys in the Medications table correspond to the new primary keys in the Patients table.

Statistical Properties: Although the combinations of non-key values are entirely new, the statistical properties (e.g., frequency distribution, variance) are preserved, as are their relationships with other columns.

Relationships Between Non-Key Columns: Relationships between all columns, such as the connection between Medication and Reason and between foreign keys and other columns, are preserved in the generated data.

Synthesize with sequence model with key generator: duplicate / hash (not recommended)

In this method, key values are either duplicated or hashed while generating entirely new data for non-key columns. This preserves referential integrity and relationships between non-key columns but not between key and non-key columns. Hence, key generators Duplicate and Hash are generally not recommended for column's with generator AI-powered generation. New values are generated for non-key columns, while keys are either duplicated or hashed, preserving relationships and statistical properties but not the original order of non-key values or the relationships between key and non-key columns.

Duplication/Hashing of Keys: The primary keys (ID) and foreign keys (Patient ID) are either duplicated or hashed, ensuring they are unique and preserving referential integrity. However, the combinations of non-key values are newly generated by the trained generative model.

Generation of New Non-Key Values: The combination of non-key values (Gender, Country, Medication, Reason) are generated from scratch by the generative AI model, based on the learned patterns in the original data.

Referential Integrity: The referential integrity is preserved, because each foreign key value in the synthetic Medications table corresponds to a primary key value in the synthetic Patients table.

Statistical Properties: Although the combinations of non-key values are entirely new, their statistical properties (e.g., distributions, variances) are preserved, as are their relationships with other non-key columns.

Relationships Between Non-Key Columns: Relationships between non-key columns such as Medication and Reason are preserved in the generated data. However, relationships between key columns and non-key columns are not preserved because the key columns maintain the exact same order as the original data, while the non-key columns have been generated entirely new.

Synthesize with single table model with key generator: generate

In this method, entirely new key values and combinations of non-key values are generated for a single table model, preserving relationships between non-key columns and statistical properties, but not maintaining the original order of non-key values. New values are generated for both primary keys and combinations of non-key values, preserving relationships between non-key columns and statistical properties, but not the original order of non-key values.

Generation of New Values: New primary keys (ID), foreign keys (Patient ID), and combinations of non-key values (Gender, Country, Medication, Reason) are generated. While some values from the original columns may still exist in the synthetic column, the generated rows (i.e., combinations of values) are entirely newly generated by the trained generative model.

Order of Keys: The new keys do not maintain the original order. The foreign keys in the Medications table are generated based on the primary keys in the Patients table and then uniformly assigned to the Medications table using a "tiling" method, such as 201, 202, 203, 204, 205, 201, 202, 203, 204, 205, etc., illustrating that there is no 1-to-1 relationship with the original rows.

Statistical Properties: Although the combinations of non-key values are entirely new, their statistical properties (e.g., frequency distribution, variance) are preserved, as are their relationships with other non-key columns.

Relationships Between Non-Key Columns: Relationships such as the connection between Medication and Reason are preserved in the generated data.

Synthesize with single table model with key generator: hash / duplicate

In this method, key values are either hashed or duplicated while generating entirely new data for non-key columns. This preserves referential integrity and relationships between non-key columns, but the original order of the non-key values is not maintained. Statistical properties of the data are preserved. New values are generated for non-key columns, while keys are either duplicated or hashed, preserving relationships and statistical properties but not the original order of non-key values.

Duplication/Hashing of Keys: The primary keys (ID) and foreign keys (Patient ID) are either duplicated or hashed, ensuring referential integrity.

Generation of New Non-Key Values: New rows are generated based on the learned patterns within and across non-key values (Gender, Country, Medication, Reason).

Statistical Properties: Although the non-key values are entirely new, their statistical properties (e.g., frequency distribution, variance) are preserved.

Relationships Between Non-Key Columns: Statistical relationships between columns are typically preserved in the generated data.

PreviousKey generators NextJSON de-identification

Last updated 4 months ago

Was this helpful?