# Sample datasets

To provide users with practical examples for testing and analytics, we have selected datasets optimized for various scenarios. These datasets are sourced from well-known repositories and are designed to help users get started with Syntho's features effectively. For testing purposes, you can access a **multi-table dataset**, while for analytics, there is a **single-table dataset**. Additionally, a **two-table sequence dataset** is available for sequence-based modeling and evaluation. These datasets serve as a practical starting point for exploring Syntho's features and capabilities:

## **Census dataset**

* **Use Case**: Ideal for analytics and AI model training.
* **Description**: Contains demographic information, including age, education, occupation, and income classification.
* **Source**: [UCI Machine Learning Repository - Adult Dataset](https://archive.ics.uci.edu/dataset/2/adult).

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/hyXu75SB4XzcawHGUCyy/image.png" alt=""><figcaption><p>A screenshot from census dataset</p></figcaption></figure>

Click below link to download `.csv` file.

{% file src="<https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/ty1sE2KgaXvn0phrdrWl/census.csv>" %}
Census dataset
{% endfile %}

## **COVID-19 dataset**

* **Use Case**: Useful for testing synthetic data generation on multi-table healthcare-related datasets.
* **Description**: Includes tables such as patients, conditions, encounters etc. simulated for COVID-19 scenarios.
* **Source**: [Synthea COVID Patients Dataset](https://synthea.mitre.org/downloads).

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/9Ot3lS7SHlsKizNhUXMv/image.png" alt=""><figcaption><p>A screenshot from patients table</p></figcaption></figure>

Click below link to download `.zip` file for 10k patient records with COVID-19 in the CSV format. If you would like to download 100k patient records version, please click [here](https://mitre.box.com/shared/static/wk3560f962ozlg7sd2oj1zxk73ayqvm0.zip).

{% file src="broken-reference" %}
Covid datasets with 10k records
{% endfile %}

## **Baseball dataset**

* **Use Case**: Suitable for analytics and sequence-based data generation.
* **Description**: Features player statistics and seasonal performance data.
* **Source**: [Lahman Baseball Dataset](https://lahman.r-forge.r-project.org/).

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/dp56MLApOkIrMAfFkBaO/image.png" alt=""><figcaption><p>A screenshot from players table</p></figcaption></figure>

<figure><img src="https://content.gitbook.com/content/U61B9DqtWCNO3Z30vnjh/blobs/q4Efo9oVca27J2WcqkPS/image.png" alt=""><figcaption><p>A screenshot from seasons table</p></figcaption></figure>

Click below link to download `.zip` file.

{% file src="broken-reference" %}
Baseball dataset
{% endfile %}
