Additional privacy controls

AI-generated synthetic data offers very high privacy levels. To maximize privacy levels with AI synthesis, Syntho provides a further set of privacy controls:

1. Overfitting Prevention

Prevents the model from memorizing specific patterns or properties of the original data, thus enhancing data confidentiality. During the training phase, Syntho minimizes overfitting by applying a so-called sample noise noise ratio that ensures synthetic data reflects general patterns rather than specific entries. The privacy evaluation metrics can also be used to demonstrate there is no overfitting.

2. Rare Category Protection

Protects the uniqueness of categorical data by substituting rare values. Rare categories, defined by a user-set threshold, are replaced with a placeholder (default: "*"). This prevents overfitting on unique, infrequent categories and protects against potential identification based on rare data points.

3. Extreme Value Protection

Removes outliers in numerical and date-time data to prevent re-identification based on extreme values. Outliers are detected and removed during the preprocessing phase, ensuring that potentially sensitive or identifiable extreme values do not appear in the synthetic data.

4. Extreme Sequence Length Protection

Limits the inclusion of unusually long sequences in subject-based data to prevent potential re-identification. Sequence lengths are capped to a threshold, filtering out excessively long sequences that could lead to confidentiality risks.

5. Random Noise Injection

Adds random noise to synthetic values to further enhance privacy. Random noise can be injected into generated synthetic data, introducing slight variations that enhance privacy while maintaining data utility. This noise factor (either absolute or relative) is configurable within the Workspace default settings.

Privacy evaluation

Syntho leverages open-source synthetic data evaluation libraries like SDMetrics to provide a transparent assessment of synthetic data quality and privacy. The platform includes an evaluation notebook that contains data quality and privacy metrics, allowing you to see how your synthetic data performs against industry standards for confidentiality and utility.

PreviousPrepare your sequence data NextCross-table relationships limitations

Last updated 1 day ago

Was this helpful?