# AI synthesize

AI synthesize is built for one job: learning patterns from a single entity table.

### When to use AI synthesize

Use AI synthesize when all of these are true:

1. Your input is a single entity table.
2. One row describes one entity.
3. Rows are independent, and row order does not matter.
4. You need realistic statistical patterns, not exact record recovery.
5. You want a synthetic feature dataset for ML or analytics.
6. You want more rows that follow the source distribution.

### When not to use AI synthesize

Do not use AI synthesize for these cases:

1. You need multi-table logic, joins, or cross-table consistency.
2. You need sequence or time-series behavior where order carries meaning.
3. You must preserve rare events, edge cases, or low-frequency patterns.
4. You need hard business rules to hold with 100% certainty.
5. You need reconciliation, regression assertions, or 1:1 traceability.
6. You expect extra synthetic rows to create new real-world information.

{% hint style="warning" %}
AI synthesize is often misunderstood as a general-purpose data generator.

It is not designed to solve every data generation case.

It learns dominant patterns from one entity table.

It does not guarantee full relational logic, exact edge cases, or new signal that is absent from the source.

If you need strict rules or multi-table behavior, use [Mock](/configure-a-data-generation-job/configure-column-settings/mockers.md), [Mask](/configure-a-data-generation-job/configure-column-settings/mask.md), and [Calculated columns](/configure-a-data-generation-job/configure-column-settings/calculated-columns.md), or reshape the source into a single entity table first.
{% endhint %}

## Apply AI synthesize

1. Open your **workspace**.
2. From the **Main hub** or **Table view** tab, select the column where you want to apply a generator.
3. Under **Column parameters** > **Generator,** select AI synthesize to enable Syntho's machine learning (ML) models to automatically synthesize the data in your tables.
4. Set the relevant AI synthesize parameters.
5. Select **Confirm**.

<figure><img src="/files/UlV98TNQHvJzuGcnxr2b" alt="" width="563"><figcaption><p>Selecting generators in column parameters</p></figcaption></figure>

## Preparing your data

When using AI synthesize, it is important that your data is fit to synthesize.

### Entity tables

Syntho expects your data to be stored in **entity tables** that satisfy the following:

* To minimize privacy risks and improve the training algorithm's generalization ability, as a rule of thumb, a minimum **column-to-row ratio of 1:500** is recommended. For example, if your source table has 6 columns, it should contain a minimum of 3000 rows.
* Each entity is described in one row.
* Each row can be treated **independently.**\
  The order of the rows does not convey any information. The contents of one row also do not affect other rows.
* Avoid column names with **privacy-sensitive information**, like`patient_a_medications`, `patient_b_medications`, etc.. Instead, have a patient column with the names. This prevents patient names from being exposed in metadata or bypass rare category protection (e.g., there’s a `patient_a` column, but this patient only appeared five times in the whole dataset).
* Remove columns that are **derived directly from other columns**. For example, you may have a `net_amount` column that is derived from the `gross_amount` and `taxes` columns. For categorical columns, there could be hierarchical relationships, such as a redundant `Treatment category` column referring to a `Treatment` column. Removing such redundant columns will simplify the modeling process and will lead to higher quality synthetic data.

<figure><img src="/files/1F5E2DgaFCqm3tORDCiG" alt=""><figcaption><p>Example of an entity table (each row describes an individual patient, and be treated independently)</p></figcaption></figure>

The Syntho platform supports a wide variety of data types. Under the hood, Syntho uses an encoding scheme where each data type is mapped to one of the following encoding types.

| Data type                                                                                                      | Description                                  |
| -------------------------------------------------------------------------------------------------------------- | -------------------------------------------- |
| [Discrete](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation.md#discrete)       | Numerical counts (e.g. number of visits)     |
| [Continuous](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation.md#continuous)   | Continuous values (e.g. weight, temperature) |
| [Categorical](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation.md#categorical) | Predefined values (e.g. blood type, country) |
| [Datetime](/configure-a-data-generation-job/configure-column-settings/ai-powered-generation.md#datetime)       | Timestamps and dates (e.g. created at)       |

### Discrete

Syntho uses a discrete encoding type to synthesize numerical values that have a countable number of values between any two values. For example, the number of customer complaints or the number of flaws or defects.

### Continuous

To synthesize numerical values that have an infinite number of values between any two values, such as weight and height, Syntho uses a continuous encoding type.

### Categorical

A categorical column has one of a fixed number of possible values. These variables, like the blood type of a person (i.e., `A, B, AB or O`), have a fixed set of categories. Categorical encoding prevents random values (for instance, `M, X or Z`) from appearing in your synthetic dataset.

Under the **Encoding type >** [**Advanced settings**](#advanced-column-settings), the [**Rare category protection** **settings**](#rare-category-protection) will appear, which can be used to protect rare categories. These categories could potentially re-identify outliers within the synthetic data.

{% hint style="info" %}
**Note**: The categorical encoding type is the **default fallback encoding type** used by Syntho. This means that any database types that are unknown by Syntho will automatically be encoded as categorical.
{% endhint %}

### Datetime

The encoding type known as **Datetime** is used to describe values that incorporate either one of, or both a date component and a time component.

By using this encoding type, Syntho is able to synthesize these values and generate dates and times that are statistically valid and representative.

Syntho supports all date and datetime data types for the [**Syntho connectors**](/setup-workspaces/create-a-workspace/connect-to-a-database.md).

#### Limitations

* Datetime columns support precision up to milliseconds. Nanosecond precision is not supported.

## Rare category protection

Following the privacy-by-design principle, Syntho automatically replaces all rare categorical observations with a user-defined value in a column encoded as a categorical column.

Replacing those rare categories helps to prevent that those sensitive values leak through into the synthetic data.

* **Rare category protection threshold**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
* **Rare category replacement value**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.

Under **Column parameters > Encoding type,** select **Advanced settings** to adjust the **Rare category protection threshold** and **Rare category replacement value**.

By default, the **rare category protection threshold** value is set at 10. This means that all column values that occur 10 times or less are automatically replaced by the user-defined value.

Under **Column settings > Encoding type,** select **Advanced settings** to adjust the **Rare category replacement value**.

By default, the **rare category replacement value** is an asterisk (**\***). This means that all values that occur equal or fewer times than the **rare category protection threshold** value will be replaced with the replacement value.

<figure><img src="/files/mJHZOk9Pe2XYkl36S7nv" alt="" width="563"><figcaption><p>Advanced settings for a rare category</p></figcaption></figure>

## Advanced settings

### Advanced generator settings

Go to **Table settings** on the right panel, scroll down to see **Advanced settings** to view and adjust settings on the generator-level. Depending on the job configuration, a generator is applied to one or more columns.

You can adjust the following advanced generator settings:

1. **Maximum rows used for training**: The maximum number of rows to be used for training. Using fewer rows can speed up the process. Leave this value at None to use all rows for training.
2. **Take random sample:**
   * **On**: takes a random sample of rows used for training.
   * **Off**: takes the top rows as defined in the database.

### Advanced column settings

Select **Advanced settings** under **Encoding type** to view and adjust settings on the column-level.

You can adjust the following advanced column settings, depending on the selected encoding type:

#### Discrete | Continuous | Datetime

1. **Clipping threshold:** The floor and ceiling of a column as the *`Nth`* lowest and highest value, where *`N`* is the clipping threshold. The threshold value will process the values as not to exceed the ceiling and floor.

#### Categorical | Text containing PII

1. **Rare category protection threshold**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced.
2. **Rare category replacement value**: All column values that occur as frequently or less than the rare category protection threshold are automatically replaced by this replacement value.
3. **Locale**: The locale used by the text processing models for columns with text containing PII.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.syntho.ai/configure-a-data-generation-job/configure-column-settings/ai-powered-generation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
