LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mock
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Subsetting modes
  • How does Syntho subsetting work
  • Limitations & recommendations

Was this helpful?

  1. Subsetting

What is subsetting

Coming Soon

Syntho's improved subsetting feature will be re-introduced in the platform soon.

Database subsetting refers to the practice of creating a smaller, representative subset of a larger database. It allows you to efficiently reduce the size of your destination database while maintaining the data's referential integrity. You can configure how Syntho should generate a subset. For instance, you may set up subsetting to only include 5% of all transactions or all data associated with patients in the Netherlands.

Subsetting can be relevant for several reasons:

  1. Testing: A subset of a database can be used for testing new features, changes or upgrades before implementing them on the production database. This can help identify potential issues or bugs before they affect the entire database.

  2. Development: Developers often require a subset of the database to work on new features or applications without interfering with the production database. This can speed up the development process and reduce the risk of data loss or corruption.

  3. Training: A (synthetic) subset of the database can be used for training purposes, allowing users to learn how to use the database in a safe and controlled environment.

  4. Compliance: Some regulatory requirements or data privacy laws may require the use of a subset of data for specific purposes, such as testing or auditing, to ensure that sensitive information is protected.

Subsetting modes

Subsetting in Syntho uses foreign keys to identify relationships in the data. These relationships are either already present in the source database or can be created using Syntho's foreign keys tool. The process of subsetting uses these relationships to navigate the database and build the subset.

Each table in the source database falls into one of the following categories for subsetting:

  1. Target table: The target table is the starting point for building the subset and provide the initial set of rows. Syntho retrieves data from the target table to then identify information from linked tables. A target table usually contains entities that are associated with most other information in the database, such as patients, customers, or transactions. To safeguard referential integrity, no more than a single target table can be set.

  2. Linked tables: These tables are connected directly or indirectly to a target table.

    • Downstream tables have data that is necessary to maintain referential integrity in the subset and have primary keys that are referenced by foreign keys in related tables.

    • Upstream tables contain data with a foreign key referencing a primary key in the target table and may contain useful information.

    Some related tables may be both downstream and upstream, in which case filtering can only be applied to upstream records. Syntho pulls data from linked tables to preserve referential integrity in the output subset. Links between tables may be direct, such as a target table listing allergies that references a patients table through a foreign key relationship, or indirect, such as a target table referencing a patients table, which in turn references a hospitals table. The subset includes all relevant information from linked tables.

  3. Duplicated tables: A duplicate table contains static values that are used in other tables in the subset. Syntho retrieves all data in a lookup table, regardless of whether or where the values are used. Relationships with duplicate tables are not considered during the subsetting process.

  4. Excluded tables: Tables that are not designated as target tables, linked tables or duplicated tables, are excluded from the subset.

How does Syntho subsetting work

The Syntho subsetting algorithm treats a source database as a direction acyclic graph where each table is a node and the relationships are the directed edges. With this abstraction in mind, we can traverse the whole graph from any point (target table) to all the other connected tables using the topological order.

Once the target table has been copied to the target database, the upstream and downstream process begins where data for the linked tables is copied in such a way that all rows are being referenced correctly such that referential integrity check do not fail. If any duplicate or excluded tables are set in the configuration, these are copied as following, but the referential integrity cannot be guaranteed.

A general view of the subsetting flow can be seen below:

  1. Remove cyclic relationships from database

  2. Copy target table from source to destination database

  3. Copy duplicate tables to destination database

  4. Copy upstream tables to destination database

  5. Copy downstream tables to destination database

Limitations & recommendations

  • The subsetting process takes as input an acyclic directed graph and as such if the source database contains cycles this would not work. Syntho can handle such cases by removing one of the edges that causes the cycle to exist. This will also mean that referential integrity cannot be preserved in the final product. It is recommended to manually remove cycles to get the most representative subset.

  • The subsetting process is very dependent on the configuration of the target table. Although the user can select 5% of the target table, that does not mean the final database will contain 5% of all data. There is a possibility that the 5% of the target table links to all the data in the other tables, meaning that to keep referential integrity all the other records will be copied. To achieve a final subset values close to one’s linkings, an adequate target table needs to be selected.

Caution: The subsetting process is not parallelizable by nature, meaning that for large database the process can take hours, or even a full day as it depends on the reading and writing speeds of the database.

Last updated 11 months ago

Was this helpful?