What is subsetting

Database subsetting refers to the practice of creating a smaller, representative subset of a larger database. It allows you to efficiently reduce the size of your destination database while maintaining the data's referential integrity. You can configure how Syntho should generate a subset. For instance, you may set up subsetting to only include 5% of all transactions or all data associated with patients in the Netherlands.

Subsetting can be relevant for several reasons:

  1. Testing: A subset of a database can be used for testing new features, changes or upgrades before implementing them on the production database. This can help identify potential issues or bugs before they affect the entire database.

  2. Development: Developers often require a subset of the database to work on new features or applications without interfering with the production database. This can speed up the development process and reduce the risk of data loss or corruption.

  3. Training: A (synthetic) subset of the database can be used for training purposes, allowing users to learn how to use the database in a safe and controlled environment.

  4. Compliance: Some regulatory requirements or data privacy laws may require the use of a subset of data for specific purposes, such as testing or auditing, to ensure that sensitive information is protected.

Subsetting modes

Subsetting in Syntho uses foreign keys to identify relationships in the data. These relationships are either already present in the source database or can be created using Syntho's foreign keys tool. The process of subsetting uses these relationships to navigate the database and build the subset.

Each table in the source database falls into one of the following categories for subsetting:

  1. Target table: The target table is the starting point for building the subset and provide the initial set of rows. Syntho retrieves data from the target table to then identify information from linked tables. A target table usually contains entities that are associated with most other information in the database, such as patients, customers, or transactions. To safeguard referential integrity, no more than a single target table can be set.

  2. Linked tables: These tables are connected directly or indirectly to a target table.

    • Downstream tables have data that is necessary to maintain referential integrity in the subset and have primary keys that are referenced by foreign keys in related tables.

    • Upstream tables contain data with a foreign key referencing a primary key in the target table and may contain useful information.

    Some related tables may be both downstream and upstream, in which case filtering can only be applied to upstream records. Syntho pulls data from linked tables to preserve referential integrity in the output subset. Links between tables may be direct, such as a target table listing allergies that references a patients table through a foreign key relationship, or indirect, such as a target table referencing a patients table, which in turn references a hospitals table. The subset includes all relevant information from linked tables.

  3. Duplicated tables: A duplicate table contains static values that are used in other tables in the subset. Syntho retrieves all data in a lookup table, regardless of whether or where the values are used. Relationships with duplicate tables are not considered during the subsetting process.

  4. Excluded tables: Tables that are not designated as target tables, linked tables or duplicated tables, are excluded from the subset.

How does Syntho subsetting work

The Syntho subsetting algorithm treats a source database as a direction acyclic graph where each table is a node and the relationships are the directed edges. With this abstraction in mind, we can traverse the whole graph from any point (target table) to all the other connected tables using the topological order.

Once the target table has been copied to the target database, the upstream and downstream process begins where data for the linked tables is copied in such a way that all rows are being referenced correctly such that referential integrity check do not fail. If any duplicate or excluded tables are set in the configuration, these are copied as following, but the referential integrity cannot be guaranteed.

A general view of the subsetting flow can be seen below:

  1. Remove cyclic relationships from database

  2. Copy target table from source to destination database

  3. Copy duplicate tables to destination database

  4. Copy upstream tables to destination database

  5. Copy downstream tables to destination database

Limitations & recommendations

  • The subsetting process takes as input an acyclic directed graph and as such if the source database contains cycles this would not work. Syntho can handle such cases by removing one of the edges that causes the cycle to exist. This will also mean that referential integrity cannot be preserved in the final product. It is recommended to manually remove cycles to get the most representative subset.

  • The subsetting process is very dependent on the configuration of the target table. Although the user can select 5% of the target table, that does not mean the final database will contain 5% of all data. There is a possibility that the 5% of the target table links to all the data in the other tables, meaning that to keep referential integrity all the other records will be copied. To achieve a final subset values close to one’s linkings, an adequate target table needs to be selected.

Caution: The subsetting process is not parallelizable by nature, meaning that for large database the process can take hours, or even a full day as it depends on the reading and writing speeds of the database.

Last updated