LogoLogo
Go to Syntho.AI
English
English
  • Welcome to Syntho
  • Overview
    • Get started
      • Syntho bootcamp
        • 1. What is Syntho?
        • 2. Introduction data anonymization
        • 3. Connectors & workspace creation
        • 4. PII scan
        • 5. Generators
          • Mockers
          • Maskers
          • AI synthesize
          • Calculated columns
          • Free text de-identification
        • 6. Referential integrity & foreign keys
        • 7. Workspace synchronization & validation
        • 8. Workspace & user management
        • 9. Large workloads​
        • 10. Data pre-processing
        • 11. Continuous Success
      • Prerequisites
      • Sample datasets
      • Introduction to data generators
    • Frequently asked questions
  • Setup Workspaces
    • View workspaces
    • Create a workspace
      • Connect to a database
        • PostgreSQL
        • MySQL / MariaDB
        • Oracle
        • Microsoft SQL Server
        • DB2
        • Databricks
          • Importing Data into Databricks
        • Hive
        • SAP Sybase
        • Azure Data Lake Storage (ADLS)
        • Amazon Simple Storage Service (S3)
      • Workspace modes
    • Edit a workspace
    • Duplicate a workspace
    • Transfer workspace ownership
    • Share a workspace
    • Delete a workspace
    • Workspace default settings
  • Configure a Data Generation Job
    • Configure table settings
    • Configure column settings
      • AI synthesize
        • Sequence model
          • Prepare your sequence data
        • QA report
        • Additional privacy controls
        • Cross-table relationships limitations
      • Mockers
        • Text
          • Supported languages
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • Other
      • Mask
        • Text
        • Numeric (integer)
        • Numeric (decimal)
        • Datetime
        • UUID
      • Duplicate
      • Exclude
      • Consistent mapping
      • Calculated columns
      • Key generators
        • Differences between key generators
      • JSON de-identification
    • Manage personally identifiable information (PII)
      • Privacy dashboard
      • Discover and de-identify PII columns
        • Identify PII columns manually
        • Automatic PII discovery with PII scanner
      • Remove columns from PII list
      • Automatic PII discovery and de-identification in free text columns
      • Supported PII & PHI entities
    • Manage foreign keys
      • Foreign key inheritance
      • Add virtual foreign keys
        • Add virtual foreign keys
        • Use foreign key scanner
        • Import foreign keys via JSON
        • Export foreign keys via JSON
      • Delete foreign keys
    • Validate and Synchronize workspace
    • View and adjust generation settings
  • Deploy Syntho
    • Introduction
      • Syntho architecture
      • Requirements
        • Requirements for Docker deployments
        • Requirements for Kubernetes deployments
      • Access Docker images
        • Online
        • Offline
    • Deploy Syntho using Docker
      • Preparations
      • Deploy using Docker Compose
      • Run the application
      • Manually saving logs
      • Updating the application
    • Deploy Syntho using Kubernetes
      • Preparations
      • Deploy Ray using Helm
        • Upgrading Ray CRDs
        • Troubleshooting
      • Deploy Syntho using Helm
      • Validate the deployment
      • Troubleshooting
      • Saving logs
      • Upgrading the applications
    • Manage users and access
      • Single Sign-On (SSO) in Azure
      • Manage admin users
      • Manage non-admin users
    • Logs and monitoring
      • Does Syntho collect any data?
      • Temporary data storage by application
  • Syntho API
    • Syntho REST API
Powered by GitBook
On this page
  • Access the workspace default settings
  • How to modify settings
  • Configuration options
  • General settings
  • PII model settings
  • Initialization and data handling
  • Privacy control defaults
  • Text processing
  • Sequence model parameters
  • Optimization and advanced settings
  • Other

Was this helpful?

  1. Setup Workspaces

Workspace default settings

The Workspace default settings menu allows to configure default parameters for workspaces. These settings ensure consistent behavior for data processing, privacy protection, and synthetic data generation. Below is a detailed explanation of the available options.

Access the workspace default settings

Note that user should be an Owner or Editor to access Workspace Default Settings.

  1. Create or open the workspace.

  2. Use the shortcut CTRL + SHIFT + ALT + 0 to open the Workspace default settings menu. If this short key is reserved on your system, you can add /global_settings to the end of the workspace URL.

How to modify settings

  1. Access the Workspace default settings menu.

  2. Modify the required values directly.

  3. Save changes to apply them to the workspace.

Configuration options

Below is an overview of the default settings and their functionalities:

General settings

Parameter

Default

Possible Values

Description

seed_value

42

Any integer (e.g., 0, 42, 1234)

use_seed

false

true or false (Boolean)

PII model settings

  • Specifies the natural language processing (NLP) models for PII detection and mockers.

  • Example:

Parameter

Default

Possible Values

Description

nlp_engine_name

"spacy"

"spacy" or any other supported NLP engine name

Determines which NLP engine is used for PII scanning and detection.

models

N/A

A list of dictionaries with "lang_code" and "model_name". E.g.: [{ "lang_code": "en", "model_name": "en_core_web_md" }]

Each item defines a specific language and its associated model for PII detection.

gpu

false

true or false (Boolean)

Toggles GPU acceleration. When true, models that support GPU will run faster but require a compatible GPU setup.

Initialization and data handling

Parameter

Default

Possible Values

Description

initialization_mode

"SCRATCH"

"SCRATCH", "APPEND", or "READ_ONLY"

key_generation_method

"duplicate"

"generate", "duplicate", "hash"

n_parallel_pipeline_processes

1

Any integer (e.g., 1, 2, -1 for all CPUs)

Controls the number of column-processing jobs (fitting, transforming, inverse-transforming) that run in parallel. Higher values can speed up processing but use more system resources.

default_n_training_rows

100000

Any positive integer (e.g., 10, 1000, 100000)

Privacy control defaults

Parameter

Default

Possible Values

Description

default_sample_noise_ratio

0.001

Any positive numeric value (float). • 0 < ratio ≤ 1: relative noise • > 1: absolute std. dev.

Specifies the level of noise added to synthetic data. • Between 0 and 1: noise is added as a relative ratio. • Above 1: noise is treated as an absolute standard deviation.

default_min_sample_size

5

Any positive integer value

Minimum sample size used for model training

default_cardinality_threshold

10

Any integer ≥ 1 (e.g., 5, 10, 20)

default_rare_category_replacement

"*"

Any string (e.g., "*" or "other")

default_clip_threshold

0

Any numeric value (integer or float). • 0: no clipping • Positive number: outlier limit

Text processing

Parameter

Default

Possible Values

Description

default_text_processor_model_settings

No single value

Same format as pii_model_settings (language models, engine, GPU)

Specifies NLP models for advanced text processing tasks (non-PII or general text analytics).

default_textpii_parallel_jobs

2

Any integer ≥ 1; -1 to use all available processors

Defines how many parallel jobs are used when scanning text for PII. Increasing the number of jobs speeds up scanning but uses more CPU resources.

default_textpii_scan_batch_size

1000

Any integer ≥ 1 (e.g., 100, 1000, 5000)

Batch size for PII detection in text columns. Larger batches can be faster but may consume more memory.

Sequence model parameters

Parameter

Default

Possible Values

Description

default_max_sequence_length

10000

Any integer ≥ 1 (e.g., 100, 1000, 10000)

default_end_of_sequence_token

-123456789.98765433

Any numeric token unlikely to appear in real data

A special marker denoting the end of a sequence, ensuring it is not confused with real data values.

default_long_sequence_threshold

10

Any integer ≥ 1 (e.g., 10, 100)

Defines a limit for the length of data sequences used in training, adjusting the longest sequences to the length of the Nth sequence. This helps prevent large sequences from overwhelming memory or computational resources.

Optimization and advanced settings

Parameter

Default

Possible Values

Description

default_ray_memory_optimization

true

true or false (Boolean)

When true, the system explicitly releases idle Ray workers between jobs, reducing memory usage. When false, workers remain alive, reducing overhead for frequent runs.

default_fast_executemany

false

true or false (Boolean)

default_drop_indexes

false

true or false (Boolean)

Temporarily drops indexes before inserting synthetic data and re-creates them afterward. Often speeds up inserts but re-building indexes can be time-consuming for large tables.

Other

Parameter

Default

Possible Values

Description

default_locale

"en"

Supports various locales, including: "en", "nl", "de", "ja"

Sets the default locale for language-based processing, such as date parsing or random text generation.

default_order_by_nr_columns

[3, 1]

A list of integers (e.g., [3, 1], [1, 2, 3])

default_max_pending_tasks

5

Any positive integer

Defines the number of tables that can be processed in parallel when using ranked scheduling. Increasing the value can improve performance through greater concurrency, but also increases memory usage. It's recommended to adjust this setting gradually, start with the default, monitor system performance, and tune based on available memory and database connection limits to ensure a balanced and stable operation.

PreviousDelete a workspaceNextConfigure table settings

Last updated 7 days ago

Was this helpful?

used by generators with consistent mapping enabled or inherently consistent (e.g., hash) generators. Consistent Mapping: When use_seed is true, changing seed_value will rotate the mapping so the same input maps to a different output, allowing you to periodically change synthetic outputs for security.

If true, all generators that support consistent mapping (mock, mask, hash) will use your specified seed_value. If false, the system does not use any for consistent mapping, meaning changes to seed_value have no effect on the mapping scheme.

: "en" with model name "en_core_web_md".

: English, Dutch, German, Japanese.

For more information please see .

Defines how the . • "SCRATCH" starts an empty workspace. • "APPEND" adds data to existing tables. • "READ_ONLY" prevents any modifications.

Determines the method for . • "generate" creates new keys. • "duplicate" copies source keys. • "hash" applies a hash function.

Sets the default used to train synthetic data models. If the input dataset exceeds this number, only the specified number of rows is used for training (unless otherwise configured).

Any category in a categorical column with occurrences below this threshold is considered “” and gets replaced.

for rare categories to preserve privacy.

Limits extreme outliers in numeric columns by “” values above/below a certain threshold.

Specifies the for sequential data generation or processing.

Enables for bulk inserts.

Defines the order in which columns are processed or modeled, which can be relevant for preserving data order in AI-powered generation or for .

workspace is initialized
generating keys
Sets the seed
seed
maximum sequence length
number of rows
certain backends
Language code
Supported languages
rare
Placeholder
clipping
fast execution
Configure to use other NLP models (limited support)