# AI synthesize

AI synthesize can be especially useful in the following situations:

1. To generate synthetic feature dataset for ML model development.
2. When statistical accuracy and maximum privacy are needed.
3. To expand dataset rows while maintaining original statistical properties.

## Apply AI synthesize

1. Open your **workspace**.
2. From the **Main hub** or **Table view** tab, select the column where you want to apply a generator.
3. Under **Column parameters** > **Generator,** select AI synthesize to enable Syntho's machine learning (ML) models to automatically synthesize the data in your tables.
4. Set the relevant AI synthesize parameters.
5. Select **Confirm**.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/vajHFp6cauATSmB11W62/image.png" alt="" width="563"><figcaption><p>Selecting generators in column parameters</p></figcaption></figure>

## Preparing your data

When using AI-powered synthetic data generation, it is important that your data is fit to synthesize.

### Entity tables

Syntho expects your data to be stored in **entity tables** that satisfy the following:

* To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum **column-to-row ratio of 1:500** is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.
* Each entity is described in one row.
* Each row can be treated **independently.**\
  The order of the rows does not convey any information. The contents of one row also do not affect other rows.
* Avoid column names with **privacy-sensitive information**, like`patient_a_medications`, `patient_b_medications`, etc.. Instead, have a patient column with the names. This prevents patient names from being exposed in metadata or bypass rare category protection (e.g., there’s a `patient_a` column, but this patient only appeared five times in the whole dataset).
* Remove columns that are **derived directly from other columns**. For example, you may have a `net_amount` column that is derived from the `gross_amount` and `taxes` columns. For categorical columns, there could be hierarchical relationships, such as a redundant `Treatment category` column referring to a `Treatment` column. Removing such redundant columns will simplify the modeling process and will lead to higher quality synthetic data.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/dv8WEHy9id8LQrcDKS4R/image.png" alt=""><figcaption><p>Example of an entity table (each row describes an individual patient, and be treated independently)</p></figcaption></figure>

### Entity table-linked table dataset

Syntho is capable of processing data in the form of **lists**, **sequences**, or **time series** when structured in entity table-linked table structure. Ensure your data satisfies the following:

* The structure is tailored for handling **lists**, **sequences**, or **time-series data**.
* It includes two tables:
  * an **entity table** that satisfies the [**Entity tables requirements**](#entity-tables).
  * a **linked table**.
* Each record in the entity table needs a unique ID (**primary key**).
* Each record in the linked table must reference the unique ID from the entity table (**foreign key**).
* Similar to the requirements for [**Entity tables**](#entity-tables), eliminate columns whose values are **directly derived from other columns**.
* Remove row values that are derived directly from values in other rows. For instance, if your dataset includes sequences with `start_date` and `end_date` columns, and each `start_date` matches the `end_date` of the row before it, remove one of these redundant values, under`start_date` or `end_date`.
* For more information on preparing your data when synthesizing complex table relationships see: [sequence-model](https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation/sequence-model "mention").

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/RvsKlVJ1I4u6ywHD6dng/image.png" alt=""><figcaption><p>Example of a linked table (multiple rows can be linked to a same patient, describing a series of time events for that patient)</p></figcaption></figure>

## Supported data types

The Syntho platform supports a wide variety of data types. Under the hood, Syntho uses an encoding scheme where each data type is mapped to one of the following encoding types.

| Data type                   | Description                                  |
| --------------------------- | -------------------------------------------- |
| [Discrete](#discrete)       | Numerical counts (e.g. number of visits)     |
| [Continuous](#continuous)   | Continuous values (e.g. weight, temperature) |
| [Categorical](#categorical) | Predefined values (e.g. blood type, country) |
| [Datetime](#datetime)       | Timestamps and dates (e.g. created at)       |

### Discrete

Syntho uses a discrete encoding type to synthesize numerical values that have a countable number of values between any two values. For example, the number of customer complaints or the number of flaws or defects.

### Continuous

To synthesize numerical values that have an infinite number of values between any two values, such as weight and height, Syntho uses a continuous encoding type.

### Categorical

A categorical column has one of a fixed number of possible values. These variables, like the blood type of a person (i.e., `A, B, AB or O`), have a fixed set of categories. Categorical encoding prevents random values (for instance, `M, X or Z`) from appearing in your synthetic dataset.

Under the **Encoding type >** [**Advanced settings**](#advanced-column-settings), the [**Rare category protection** **settings**](#rare-category-protection) will appear, which can be used to protect rare categories. These categories could potentially re-identify outliers within the synthetic data.

{% hint style="info" %}
**Note**: The categorical encoding type is the **default fallback encoding type** used by Syntho. This means that any database types that are unknown by Syntho will automatically be encoded as categorical.
{% endhint %}

### Datetime

The encoding type known as **Datetime** is used to describe values that incorporate either one of, or both a date component and a time component.

By using this encoding type, Syntho is able to synthesize these values and generate dates and times that are statistically valid and representative.

Syntho supports all date and datetime data types for the [**Syntho connectors**](https://docs.syntho.ai/setup-workspaces/create-a-workspace/connect-to-a-database).

#### Limitations

* Datetime columns support precision up to milliseconds. Nanosecond precision is not supported.

## Rare category protection

Following the privacy-by-design principle, Syntho automatically replaces all rare categorical observations with a user-defined value in a column encoded as a categorical column.

Replacing those rare categories helps to prevent that those sensitive values leak through into the synthetic data.

* **Rare category protection threshold**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
* **Rare category replacement value**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

Under **Column parameters > Encoding type,** select **Advanced settings** to adjust the **Rare category protection threshold** and **Rare category replacement value**.

By default, the **rare category protection threshold** value is set at 10. This means that all column values that occur 10 times or less are automatically replaced by the user-defined value.

Under **Column settings > Encoding type,** select **Advanced settings** to adjust the **Rare category replacement value**.

By default, the **rare category replacement value** is an asterisk (**\***). This means that all values that occur equal or fewer times than the **rare category protection threshold** value will be replaced with the replacement value.

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/XujASzN9YgEZ3XKSxRHu/image.png" alt="" width="563"><figcaption><p>Advanced settings for a rare category</p></figcaption></figure>

## Advanced settings

### Advanced generator settings

Go to **Table settings** on the right panel, scroll down to see **Advanced settings** to view and adjust settings on the generator-level. Depending on the job configuration, a generator is applied to one or more columns.

You can adjust the following advanced generator settings:

1. **Maximum rows used for training**: The maximum number of rows to be used for training. Using fewer rows can speed up the process. Leave this value at None to use all rows for training.
2. **Take random sample:**
   * **On**: takes a random sample of rows used for training.
   * **Off**: takes the top rows as defined in the database.

### Advanced column settings

Select **Advanced settings** under **Encoding type** to view and adjust settings on the column-level.

You can adjust the following advanced column settings, depending on the selected encoding type:

#### Discrete | Continuous | Datetime

1. **Clipping threshold:** The floor and ceiling of a column as the *`Nth`* lowest and highest value, where *`N`* is the clipping threshold. The threshold value will process the values as not to exceed the ceiling and floor.

#### Categorical | Text containing PII

1. **Rare category protection threshold**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
2. **Rare category replacement value**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.
3. **Locale**: The locale used by the text processing models for columns with text containing PII.
