# Frequently asked questions

### AI synthesis

#### Can real data be found in AI-generated synthetic data?

No. Synthetic data does not directly replicate the real data it is based on.

It uses learned patterns and statistical properties to generate new records, independent of the source dataset.

[Privacy control mechanisms](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/privacy-controls) such as [rare category protection](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation#rare-category-protection) and [clipping thresholds](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation#advanced-column-settings) further reduce the risk of exposing unique values or outliers.

#### Are there scenarios where similarities might appear between synthetic and real data?

In rare cases, synthetic data may contain values that match those in the original dataset. For example, if several individuals share identical characteristics in the real data, those attributes might also appear in the synthetic dataset. However, this does not pose a privacy risk, as these instances are not linked to specific individuals. Methods like K-anonymity are used to minimize such risks.

#### What measures are in place to prevent exposure of sensitive data?

Syntho implements various privacy-preserving techniques.

See [Additional privacy controls](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/privacy-controls) for details.

#### How many training records do I need for AI synthesis?

To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum **column-to-row ratio of 1:500** is recommended. For example, if your **source** table has 6 columns, it should contain a minimum of 3000 rows.

#### Does AI synthesis work for multiple tables?

AI synthesis works best on single tables.

For best utility with minimal resources, [prepare your data as a single entity table](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation#entity-tables).

If you have sequence data, AI synthesis can work with 2 tables using Syntho’s [sequence model](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/sequence-model).

If you need to anonymize multiple related tables while keeping consistency across tables, Syntho’s other [generators](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings) are typically more practical.

#### With AI synthesis, what can I do to improve synthetic data utility?

There are several ways to improve the utility (a.k.a. 'quality' or 'fidelity') of the generated synthetic data. A few options:

1. Prepare a single entity table. Follow [Preparing your data](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation#preparing-your-data).
2. Increase [Maximum rows used for training](https://docs.syntho.ai/configure-a-data-generation-job/configure-table-settings#advanced-table-settings). Consider using all rows.
   * If you limit training rows, enable [Take random sample](https://docs.syntho.ai/configure-a-data-generation-job/configure-table-settings#advanced-table-settings) for a more representative subset.
3. For sequence data, structure your data as 2 related tables. Follow [Sequence model](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/sequence-model).

### Deployment

#### What permissions do I need to deploy Syntho?

Syntho requires an administrative user for installation. If you use **Docker** for the deployment, the user needs **sudo** rights for **docker/docker-compose**.

### Databases and generation jobs

#### How should I provision my databases for data generation?

The Syntho platform works with source and destination databases.

* The **source database** stores the input data for your data generation job and can be **read-only**.
* The **destination database** must have empty tables and columns, which are structurally identical to the **source** database. It requires **write access,** as the generated data will be written to it.

The **source** and **destination** databases can run on the same server, or be hosted on different servers.

#### How does Syntho handle constraints on my database?

Syntho uses relevant metadata stored in your **source** database for generating the data that is written into the **destination** database. For example, foreign key constraints are inferred to reproduce table relationships and referential integrity in the generated database.

Syntho is capable of handling scenarios where the constraints of the destination database are either enabled or disabled.

#### I have non-sensitive tables that I still want in my destination database. What should I do?

Non-sensitive tables (for example, definition tables such as language or product tables) can be copied as-is when writing to the destination database.

Use the **Duplicate** generator for the relevant columns.

For more information, see [Table view](https://docs.syntho.ai/configure-a-data-generation-job/configure-table-settings#table-modes).

#### I do not have any data yet. Can I generate data from scratch?

Yes. If you do not have any data yet, you can generate data from scratch to fill an empty database.

You can do this in the Syntho platform as follows:

1. First, set up an empty **source** database and **destination** database with the tables and columns that you want to fill. The source database should be different from the destination database.
2. Ensure primary and foreign key columns in your **source** database are correctly configured and already have **some key values present**.
3. On the **Job configuration** panel, drag all relevant tables under **Include**.
4. Use the shortcut `CTRL + SHIFT + ALT + 0` and set the value under the entry "**key\_generation\_method**" to "**generate**".
5. For each column, go to **Column settings** and select [Mock](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/mockers) as the generator.\
   Note: [Calculated columns](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/calculated-columns) are currently not supported for generating data from scratch.
6. Configure the relevant **Mock** settings.
7. Select the **number of rows** to generate under the Table settings.
8. Finally, select **Generate** to generate your database.

#### Can I use Syntho for data de-identification, masking, or pseudonymization?

Yes, Syntho also allows you to mask or de-identify your most sensitive columns. The following section describes how this can be done:

[Database de-identification](https://docs.syntho.ai/overview/get-started/database-de-identification)

### Performance and scaling

#### How can I speed up my data generation jobs?

The Syntho platform and its generative models are optimized to generate data fast and efficiently.

For example, task parallelization is applied throughout the pipeline, from reading source data to writing destination data.

Here are some additional tips to boost the speed of your data generation jobs:

1. [Exclude](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/exclude) large tables from the generation process.
2. [Reduce the number of generated rows](https://docs.syntho.ai/configure-a-data-generation-job/configure-table-settings#adjust-the-number-of-rows-to-generate) for large tables.
   * This only works if the table is not referenced by other tables.
3. For columns with [AI synthesize](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation) enabled:
   * Adjust the [advanced settings](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation#advanced-settings).
   * Reduce the number of high-cardinality columns.
4. Disable [PII obfuscation](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/duplicate/automatic-pii-discovery-and-de-identification-in-free-text-columns) for columns containing free text.
5. In some database types, **write performance** can become a bottleneck when working with larger tables. Syntho offers several options to optimize write performance:
   1. **Use parallel writing** by increasing the [**Number of simultaneous connections**](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings#advanced-generation-settings) to allow parallel writing across multiple tables.
   2. [Bypassing database checks](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings#data-generation-capabilities) (foreign key constraints, indexes, and identity/auto-increment behavior) can significantly reduce write time. This is especially effective with a [higher maximum number of connections](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings#advanced-generation-settings).
   3. Increasing the [Write batch size](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings#advanced-generation-settings) may modestly improve write speed.
   4. Write generated data directly to Parquet files in [Azure Data Lake Storage (ADLS)](https://docs.syntho.ai/setup-workspaces/create-a-workspace/connect-to-a-database/azure-data-lake-storage-adls) or [Amazon Simple Storage Service (S3)](https://docs.syntho.ai/setup-workspaces/create-a-workspace/connect-to-a-database/amazon-simple-storage-service-s3) to skip database checks.
   5. Consider synthesizing a representative subset of tables instead of the full dataset. You can prepare these subsets in your source database as *(MATERIALIZED) VIEWS*, which simplifies preprocessing and reduces storage costs. Syntho supports reading and processing views directly.

#### How can I optimize memory utilization of my cluster?

The Syntho platform offers several features to optimize memory utilization during data generation. Understanding how these features work can help you manage memory more effectively and prevent issues such as out-of-memory (OOM) errors.

1. [Batch size](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings): The number of data points processed in a single batch. Higher values can speed up generation, but increase memory usage.
2. [Number of simultaneous connections](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings#advanced-generation-settings) **(N)**: The number of tables that can be read or written in parallel. Higher values speed up generation, but increase memory usage.
3. [Maximum rows used for training](https://docs.syntho.ai/configure-a-data-generation-job/configure-table-settings#advanced-table-settings) **(N)**: The number of source rows used to train the model. More rows can improve utility, but increase memory usage.
4. [Ray dashboard](https://docs.syntho.ai/deploy-syntho/logs-and-monitoring): Monitor memory usage, CPU utilization, and other runtime metrics.
5. [OOM error logs](https://docs.syntho.ai/deploy-syntho/logs-and-monitoring): Errors like `Workers (tasks/actors) killed due to memory pressure (OOM)` mean the cluster ran out of memory. Reduce batch size, reduce parallel connections, or lower training rows.

### Data handling

#### Does Syntho collect or store any data?

Syntho does not collect or store any data.

Syntho runs in the customer’s environment, and Syntho cannot access customer data.

For more information, see [Does Syntho collect any data?](https://docs.syntho.ai/deploy-syntho/logs-and-monitoring/does-syntho-collect-any-data).
