Hash

Syntho beta feature

A hash function is a method for mapping data of varying sizes to fixed-size values and the resulting values are often referred as hash values. The default hashing approach is specified for each datatype, with specific considerations for key and regular columns.

To utilize the new hashing feature for regular columns, follow these steps:

  1. Navigate to Column Settings in your database management interface.

  2. Under Generation Method, select Hashing to enable the hashing functionality for the selected column.

  3. Choose the appropriate data type for the column and ensure that the settings align with the default hashing approaches specified above:

    • For Discrete Values, the Hasty Pudding Cipher algorithm will be used.

    • For Categorical Values, the Forward Preserving Encryption (FPE) using the FF3 algorithm will be applied.

    • For Datetime Values, a random offset will be added.

    • For UUIDs, the values will consistently map to new UUIDs.

  4. Save the settings and proceed with the data generation or transformation process.

Default Hashing Approaches by Data Type

  1. Discrete Values:

    • Algorithm: Hasty Pudding Cipher.

    • Behavior: A 1-to-1 mapping is created between all the discrete values in a range. This is used to consistently map the values in the source to new, hashed values.

    • Range: The fallback range for values is within the allowed range for a 32-bit integer, but the actual range depends on the datatype and database support.

      • Fallback Minimum: -2,147,486,647

      • Fallback Maximum: 2,147,486,647

    • Limitation: Negative numbers will always hash to negative numbers and positive numbers to positive numbers. An internal variable encodes the sign of numbers, utilizing an additional bit for the sign, taking the absolute value, and after hashing, multiplying by -1 if the original number was negative. Note that 0 is never hashed.

  2. Categorical Values:

    • Algorithm: Forward Preserving Encryption (FPE) using the FF3 algorithm.

    • Reference: FF3 Algorithm

    • Note: The minimum number of characters depends on the size of the alphabet used, typically resulting in a minimum of 4 characters.

  3. Datetime Values:

    • Method: Add a random offset to the original value.

  4. UUIDs:

    • Method: Consistently map to a new UUID.

Considerations

  • Use the shortcut CTRL + SHIFT + ALT + 0 to open Workspace Default Settings and change the key generation method according to hash by simply replacing the key_generation_method value to hash. Please note that it will apply the generator across the entire workspace.

  • Alternatively, you can add /global_settings to the end of the workspace URL to open Workspace Default Settings.

  • Oversampling is not permitted for hashing and duplication when set on a key column. For regular columns, this restriction does not apply.

Ordering and Indexing Considerations

To ensure accurate ordering, it is essential for the application to have either an index or a primary key in the source table. In the absence of these, the application defaults to sorting based on the first column of the table. However, if the first column contains duplicate values, the ordering cannot be guaranteed, as it relies on the database's sorting algorithm to handle duplicate values. Adding an index to the source table will resolve this issue.

Column Set for "ORDER BY" Clause

Hive only

In the Table Settings panel, a new dropdown field allows users to specify which columns should be used in the "ORDER BY" clause. This feature enables users to define a set of columns that ensure the uniqueness of the returned results for a given table. By selecting the appropriate columns, users can achieve deterministic ordering even in the absence of primary keys or indexes.

  • Order By Dropdown: Located in the Table Settings panel on the right side of the Table/Job Configuration screen, this dropdown lets users choose the columns for the "ORDER BY" clause.

By exposing hashing for regular columns, users gain enhanced flexibility in data processing, ensuring consistency and security across various datatypes while adhering to the specified constraints and behaviors.

Last updated