# AI-generated synthetic data

The diagram below illustrates a workflow for AI-generated synthetic data. Detailed information about each step shown in the diagram is provided throughout this page.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/VYqtsEUu3kMmvoxPitCs/AI_single_table.png" alt=""><figcaption><p>Workflow of AI-generated synthetic data process</p></figcaption></figure>

Before starting AI-generated synthetic data use case, check the video below that provides a short introduction to data generators.

{% embed url="<https://youtu.be/668vK84zCD8>" %}
A short introduction to data generators
{% endembed %}

For this key use case, a single table, named **census**, containing census-collected data, requires to be synthesized using Syntho's AI synthesize generator. Maximum privacy, alongside the generation of highly realistic data that statistically reflects the original dataset, is crucial for AI and analytics. The initial step involves creating a workspace in Syntho, linked to a census database. Once established, this workspace will feature the **census** table exclusively. Below, the table's columns and a sample of the contained data are presented.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/REvuVVnOpjAkmr8N6S8O/pic2.png" alt=""><figcaption><p>Table census</p></figcaption></figure>

***

## [Prerequisites](https://docs.syntho.ai/overview/get-started/prerequisites)

For prerequisites check [Prerequisites](https://docs.syntho.ai/overview/get-started/prerequisites) or watch the video below.

{% embed url="<https://youtu.be/BveahcOYYVk>" %}
Prerequisites for AI generated synthetic data
{% endembed %}

***

## Preparing your data

For AI synthesize, ensure your data is fit to synthesize. Syntho expects your data to be stored in an **entity table** that adheres to specific guidelines:

1. Maintain a minimum **column-to-row ratio of 1:500** for privacy and algorithmic generalization. With 15 columns, aim for at least 7,500 rows; our example database exceeds this with 48,842 rows (see below illustration).
2. Describe **each entity in a single row**, ensuring row independence without sequential information. That means each row can be treated independently. The order of the rows does not convey any information. The contents of one row also do not affect other rows.
3. **Remove columns that are derived directly from other columns and not contain additional information**. For example, you may have a redundant duration column that is derived from the start\_time and end\_time columns. For categorical columns, there could be hierarchical relationships, such as a redundant Treatment category column referring to a Treatment Type column. Removing such columns containing redundant information will simplify the modeling process and will lead to higher quality synthetic data. If not removed, such columns constructed with [calculated columns](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/calculated-columns) feature.
4. Avoid column names with privacy-sensitive information, like patient\_a\_medications, patient\_b\_medications, etc. Instead, simply have a patient column with the names in it. This prevents patient names from being exposed in metadata or bypass rare category protection (e.g., there’s a patient\_a column, but this patient only appeared five times in the whole dataset).

Our example table fully meets these criteria.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/GslLJTlP787XzsU7IkGM/pic4.png" alt="" width="375"><figcaption><p>Calculating rows</p></figcaption></figure>

{% embed url="<https://youtu.be/i--iXsJNWag>" %}
Preparing your data
{% endembed %}

## Configuring Column Settings

Under **Column settings > Generation Method**, select **AI synthesize** for Syntho's ML models to synthesize data.

{% embed url="<https://youtu.be/w0ABWEI51ss>" %}
Configure table settings
{% endembed %}

## [Rare category protection](#rare-category-protection)

This feature helps hide sensitive or rare observations, like specific occupations in the **census** table, by replacing them with a user-defined value, enhancing data privacy. Adjust the **rare category protection threshold** and **replacement value** in **Column settings > Encoding type > Advanced settings**.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/VAYZThIH8xr89KuR5ZqE/pic7.png" alt="" width="186"><figcaption><p>Column occupation</p></figcaption></figure>

* **Rare category protection threshold**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
* **Rare category replacement value**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

For example, occupations appearing **as frequently or less than 15**, will be replaced with the sign asterisk or “\*”. The number and replacement value are voluntarily and can be defined per user request (See below illustration).

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/4RgWo4CRjTyVdoZVnjRS/image.png" alt="" width="375"><figcaption><p><strong>Rare category protection</strong></p></figcaption></figure>

{% embed url="<https://youtu.be/IK-9IFyvB80>" %}
Rare category protection
{% endembed %}

## [PII scanner](https://docs.syntho.ai/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner) and [Mockers](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/mockers)

{% hint style="info" %}
The PII scanner provides a starting point for PII detection. Users should perform additional reviews to identify and handle any other sensitive data that may not be detected by the scanner.
{% endhint %}

### [PII scanner](https://docs.syntho.ai/configure-a-data-generation-job/privacy-dashboard/automatic-pii-discovery-with-pii-scanner)

In the **PII tab**, you can add new columns to the list of PII columns, either manually or by using Syntho's **PII scanner**. You have the option to manually label columns containing PII by selecting the column name and optionally choosing a mocker to apply. Clicking "**Confirm**" will mark the column as containing PII and confirm the **mocker** selection.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/ovkmVnQoZgPgCPxPefJN/pic5_manual.png" alt="" width="563"><figcaption><p>Manual PII labelling</p></figcaption></figure>

Alternatively, deploy automatic PII discovery with the PII scanner. Launch a scan to detect PII across all database columns from the PII tab in the **Job Configuration** panel. Note that the scanner offers both **Shallow** and **Deep** scan modes:

* The **shallow scan** assesses columns using regular expression rules to identify PII, optimized for speed but with variable accuracy.
* The **deep scan** examines both metadata and data within columns for a thorough PII identification.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/DhUSBW8ITWYuispugJ0o/pic5_PII.png" alt="" width="563"><figcaption><p>Automatic PII labelling</p></figcaption></figure>

### [Mockers](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/mockers)

As with de-identification, you can use mockers or exclude to replace any PII columns. If not, those PII columns will be treated as categorical columns and processed by Syntho's “[Rare category protection](#rare-category-protection)”.

{% embed url="<https://youtu.be/JwoiuoH6Abc>" %}
PII scanner and Mockers
{% endembed %}

## [Advanced generator settings](#advanced-generator-settings)

In the job settings, under table settings, you can adjust generator-level settings, including the maximum number of rows for training to optimize speed. Leaving this setting as **None** utilizes all rows. The **Take random sample** option allows for sampling:

* **On**: Random rows are selected for training.
* **Off**: Top rows as per the database are used.

## Start data generation process

To start data generation, you can do the following:

1. On the **Job configuration** panel, select **Generate.**
2. On the **Job configuration summary** panel, adjust generation parameters as desired.
3. Finally, select **Start generating**.

## [Model parameters](https://docs.syntho.ai/configure-a-data-generation-job/generation-and-validation/view-and-adjust-generation-settings)

Before initiating the generation process, you have the option to modify model parameters. Here's an overview:

* **Read batch size:** The number of rows read from each source table per batch.
* **Write batch size:** The number of rows inserted into each destination table per batch.
* **N connections:** Specifies the number of connections.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/31tze9pRC4sHOVJ4m2iE/image.png" alt="" width="375"><figcaption><p>Model parameters</p></figcaption></figure>

### Truncate tables before each new data generation job

Users are required to manually **TRUNCATE** their tables in the **DESTINATION** database before initiating each new data generation job. If truncation is hindered due to existing constraints, these constraints should be temporarily disabled before truncation and then re-enabled afterwards. For instance, to facilitate the truncation process when foreign key constraints prevent it, use the following SQL commands: First, disable the constraints by executing `SET FOREIGN_KEY_CHECKS = 0;`, then **TRUNCATE** the table, and finally, re-enable the constraints with `SET FOREIGN_KEY_CHECKS = 1;`. This sequence ensures that tables are properly prepared for data generation without constraint violations.

{% embed url="<https://youtu.be/0YCd5KQgpRw>" %}
Start a data generation process
{% endembed %}

## Evaluation

[SDMetrics ](https://docs.sdv.dev/sdmetrics)is an [open-source](https://github.com/sdv-dev/SDMetrics) **Python** library designed for evaluating tabular synthetic data to determine how closely it mimics the mathematical properties of real data, known as synthetic data fidelity, through selective metric evaluation, detailed results explanation, score visualization, and report sharing capabilities.

For full metrics and information, please check the SDMetrics documentation by clicking on this [link](https://docs.sdv.dev/sdmetrics).

In the notebook below, we compare some real and synthetic demo data using SDMetrics based on SDMetrics original notebook. Also, you will find a shareable report, and you can use it to discover some insights and create visual graphics.

{% file src="<https://1383248054-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FU61B9DqtWCNO3Z30vnjh%2Fuploads%2Fgit-blob-f038b69b3f398215e5c5122c57457f34d6237671%2FSDV-evaluation-notebook-metrics.ipynb?alt=media>" %}
Jupyter Notebook
{% endfile %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/overview/get-started/ai-generated-synthetic-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
