LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mockers
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Can real data be found in synthetic data?
  • Are there scenarios where similarities might appear between synthetic and real data?
  • What measures are in place to prevent the exposure of sensitive data?
  • What permissions do I need to deploy Syntho?
  • How should I provision my databases for data generation?
  • How does Syntho handle constraints on my database?
  • I have some non-sensitive tables that I still wish to include in my destination database. What do I do with those?
  • How many training records do I need for AI-powered generation?
  • I do not have any data yet. Can I also generate data from scratch?
  • What can I do to improve the synthetic data utility?
  • Can I also use Syntho for data de-identification or pseudonymization?
  • How can I speed up my data generation jobs?
  • How can I optimize memory utilization of my cluster?
  • Does Syntho collect or store any data?

Was this helpful?

  1. Overview

Frequently asked questions

PreviousIntroduction to data generatorsNextView workspaces

Last updated 1 month ago

Was this helpful?

Can real data be found in synthetic data?

No, synthetic data does not directly replicate the real data it is based on. It uses patterns and statistical properties to generate new data that is independent of the original dataset. such as and are employed to further ensure that sensitive data points, like unique values or outliers, are not exposed.

Are there scenarios where similarities might appear between synthetic and real data?

In rare cases, synthetic data may contain values that match those in the original dataset. For example, if several individuals share identical characteristics in the real data, those attributes might also appear in the synthetic dataset. However, this does not pose a privacy risk, as these instances are not linked to specific individuals. Methods like K-anonymity are used to minimize such risks.

What measures are in place to prevent the exposure of sensitive data?

Syntho implements various privacy-preserving techniques, for more information please see .

What permissions do I need to deploy Syntho?

Syntho requires an administrative user for installation. If you use Docker for the deployment, the user needs sudo rights for docker/docker-compose.

How should I provision my databases for data generation?

The Syntho platform works with source and destination databases.

  • The source database stores the input data for your data generation job and can be read-only.

  • The destination database must have empty tables and columns, which are structurally identical to the source database. It requires write access, as the generated data will be written to it.

The source and destination database can run on the same server, or be hosted on different servers.

How does Syntho handle constraints on my database?

Syntho uses relevant metadata stored in your source database for generating the destination database. For example, foreign key constraints are inferred to reproduce table relationships in the generated database.

Syntho is capable of handling scenarios where the constraints of the destination database are either enabled or disabled.

I have some non-sensitive tables that I still wish to include in my destination database. What do I do with those?

Non-sensitive tables (e.g. 'definition tables' such as language or product tables) can be copied over as-is when writing the data to the destination database. This can be done on the Job configuration panel by marking the table as De-identify.

How many training records do I need for AI-powered generation?

To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum column-to-row ratio of 1:500 is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.

I do not have any data yet. Can I also generate data from scratch?

Yes. In some scenarios, where you do not have any data yet, you may want to generate data from scratch to fill your empty database. This can be done using the Syntho platform in the following way:

  1. First, setup an empty source database and destination database with the tables and columns that you want to fill. The source database should be different from the destination database.

  2. Ensure primary and foreign key columns in your source database are correctly configured and already have some key values in there.

  3. On the Job configuration panel, drag all relevant tables under Include.

  4. Use the shortcut CTRL + SHIFT + ALT + 0 and set the value under the entry "key_generation_method" to "generate".

  5. Configure the relevant Mocker settings.

  6. Select the number of rows to generate under the Table settings.

  7. Finally, select Generate to generate your database.

What can I do to improve the synthetic data utility?

There are several ways to improve the utility (a.k.a. 'quality' or 'fidelity') of the generated synthetic data. A list of possible options include:

Can I also use Syntho for data de-identification or pseudonymization?

Yes. Besides Syntho's core functionality, which is synthesizing, Syntho also allows you to de-identify your most sensitive columns. The following section describes how this can be done:

How can I speed up my data generation jobs?

The Syntho platform and Syntho's generative AI models are optimized to generate your data as fast and efficiently as possible. For example, tasks parallelization is applied at every step in the data generation process, from reading the source data until writing the destination data.

Here are some additional tips to boost the speed of your data generation jobs:

  1. Temporarily Disable Constraints and Indexes on the destination database to bypass database checks and maximize parallel processing possibilities, enhancing writing speed.

  2. Syntho allows you to circumvent the write performance of the database to become a bottleneck, which can become a factor for some database types when tables grow larger. Some options to mitigate database writing speeds limitations include:

How can I optimize memory utilization of my cluster?

The Syntho platform offers several features to optimize memory utilization during data generation. Understanding how these features work can help you manage memory more effectively and prevent issues such as out-of-memory (OOM) errors.

Does Syntho collect or store any data?

Syntho does not collect or store any data. The Syntho application will run in the secure environment of the customer. Syntho will be unable to access the platform, and does not collect any data from the application.

For more information, see Does Syntho collect any data?

Fore more information, visit: .

For each column, go to the Column settings and select as the generator. Note: are currently not supported for generating data from scratch.

Ensure your data is prepared according to the data preparation requirements described in the section.

Increase the value for the parameter, and consider including all rows. If the maximum number of rows used for training is set at a lower value than the total number of input rows, you can enable the parameter to get a more representative subset.

If you need to synthesize multiple related tables, follow the instructions and limitations in the feature.

any large tables (i.e. tables with many rows and columns) from the generation process.

for larger tables - the Syntho platform will ensure referential integrity remains in tact (i.e. subsetting).

When working with multiple tables, increase the to allow parallel writing of generated data.

For columns with enabled, consider adjusting the and / or reducing the number of high-cardinality columns (i.e. columns with many distinct values) in your tables.

For free text columns with applied, disable the feature.

Write the generated data to (Parquet) files in or , or your .

Increasing the could have a slight impact on writing speeds.

Consider taking a representative of those larger tables before synthesizing.

: The number of data points processed in a single batch. A higher batch size can increase the speed of data generation but requires more memory. Adjust the batch size based on the available memory in your cluster to find the optimal balance between performance and memory usage.

(N): This parameter controls the number of tables that can be read or written in parallel. Increasing N allows multiple tables to be processed simultaneously, which can speed up data generation. However, it also increases memory usage, so adjust this parameter according to the available memory and the complexity of your database schema (considering potential foreign key relationships).

(N): The number of rows from the source data used to train the generative model. Using more rows can improve the quality of the synthesized data but will require more memory. Monitor the memory usage and adjust N to avoid exceeding memory limits.

: Allows you to observe real-time memory usage, CPU utilization, and other resource metrics. Regularly monitor the Ray dashboard to track memory consumption and make necessary adjustments to batch size, parallel connections, and training rows.

: Errors such as "Workers (tasks/actors) killed due to memory pressure (OOM)" indicate that the cluster has run out of memory. If you encounter OOM errors, reduce the batch size, decrease the number of parallel connections, or lower the number of training rows to mitigate memory pressure.

Mocker
Calculated Columns
Synthesize table relationships with entity-table ranking
Database de-identification
Exclude
PII obfuscation
Azure Data Lake Storage
Amazon Simple Storage Service (S3)
Local File System
Batch Size
Parallel Connections
Training Rows
Ray Dashboard
OOM Error Logs
Privacy control mechanisms
Privacy Controls
AI-powered generation
Configure table settings > Table modes
Maximum rows used for training
Take random sample
Reduce the number of generated rows
subset
rare category protection
clipping thresholds
Preparing your data
advanced generation settings
Number of simultaneous connections
Write batch size