Privacy Controls
Syntho prioritizes privacy at every stage of synthetic data generation, offering a suite of configurable privacy controls designed to safeguard sensitive information. These settings are available across various generators, ensuring you can select privacy features best suited to your data use case. Here’s an overview of the privacy options available in Syntho:
1. Overfitting Prevention (AI-generated synthetic data only)
Prevents the model from memorizing specific patterns or properties of the original data, thus enhancing data confidentiality. During the training phase, Syntho minimizes overfitting by applying a noise ratio that ensures synthetic data reflects general patterns rather than specific entries. This safeguard prevents individual data points from appearing in the synthetic dataset.
2. Rare Category Protection (AI-generated synthetic data only)
Protects the uniqueness of categorical data by substituting rare values. Rare categories, defined by a user-set threshold, are replaced with a placeholder (default: "*"). This prevents overfitting on unique, infrequent categories and protects against potential identification based on rare data points.
3. Extreme Value Protection (AI-generated synthetic data only)
Removes outliers in numerical and date-time data to prevent re-identification based on extreme values. Outliers are detected and removed during the preprocessing phase, ensuring that potentially sensitive or identifiable extreme values do not appear in the synthetic data.
4. Extreme Sequence Length Protection
Limits the inclusion of unusually long sequences in subject-based data to prevent potential re-identification. Sequence lengths are capped to a threshold, filtering out excessively long sequences that could lead to confidentiality risks.
5. Random Noise Injection (AI-generated synthetic data only)
Adds random noise to synthetic values to further enhance privacy. Random noise can be injected into generated synthetic data, introducing slight variations that enhance privacy while maintaining data utility. This optional feature is configurable within Advanced Settings.
6. Privacy-First Default Settings
Ensures that privacy measures are applied by default, reducing the risk of accidental data exposure. Privacy controls, such as overfitting prevention, rare category protection, and extreme value protection, are enabled by default in Syntho’s configuration. This helps ensure data privacy is protected automatically, even if no additional customization is applied.
7. Evaluation and Transparency through Syntho QA Report
Provides transparency and confidence in synthetic data quality and privacy. Syntho leverages open-source synthetic data evaluation libraries like SDMetrics to provide a transparent assessment of synthetic data quality and privacy. The platform includes an evaluation notebook that contains quality and privacy metrics, allowing you to see how your synthetic data performs against industry standards for confidentiality and utility.
Last updated