Additional privacy controls
AI-powered data generation offers very high privacy levels. To maximize privacy levels with AI-powered generation, Syntho provides a further set of privacy controls:
1. Overfitting Prevention
Prevents the model from memorizing specific patterns or properties of the original data, thus enhancing data confidentiality. During the training phase, Syntho minimizes overfitting by applying a noise ratio that ensures synthetic data reflects general patterns rather than specific entries. This safeguard prevents individual data points from appearing in the synthetic dataset.
Protects the uniqueness of categorical data by substituting rare values. Rare categories, defined by a user-set threshold, are replaced with a placeholder (default: "*"). This prevents overfitting on unique, infrequent categories and protects against potential identification based on rare data points.
Removes outliers in numerical and date-time data to prevent re-identification based on extreme values. Outliers are detected and removed during the preprocessing phase, ensuring that potentially sensitive or identifiable extreme values do not appear in the synthetic data.
4. Extreme Sequence Length Protection
Limits the inclusion of unusually long sequences in subject-based data to prevent potential re-identification. Sequence lengths are capped to a threshold, filtering out excessively long sequences that could lead to confidentiality risks.
Adds random noise to synthetic values to further enhance privacy. Random noise can be injected into generated synthetic data, introducing slight variations that enhance privacy while maintaining data utility. This optional feature is configurable within Advanced Settings.
6. Evaluation and Transparency through Syntho QA Report
Provides transparency and confidence in synthetic data quality and privacy. Syntho leverages open-source synthetic data evaluation libraries like SDMetrics to provide a transparent assessment of synthetic data quality and privacy. The platform includes an evaluation notebook that contains quality and privacy metrics, allowing you to see how your synthetic data performs against industry standards for confidentiality and utility.
Last updated
Was this helpful?