githubEdit

Reassign ZINGG ID

Reassign the ZINGG IDs for clusters from the original production model

Zingg Enterprise Feature

Keep Your IDs Stable Across Changes

When you evolve your Zingg setup—whether upgrading your model, migrating infrastructure, changing data schemas, or moving to new platforms—you want to preserve the ZINGG IDs already flowing through your production systems. The reassignZinggId phase intelligently maximizes the preservation of original ZINGG IDs by mapping clusters from your new model back to your original cluster assignments, ensuring continuity and minimizing downstream disruption.

Why Use Reassign?

  • Model Upgrades: Add nickname support, improve blocking strategies, or tune classifiers without losing IDs

  • Infrastructure Migration: Move from one platform to another (e.g., Spark to Snowflake) while maintaining ID continuity

  • Schema Changes: Update your data schema or field definitions while preserving established IDs

  • Data Migration: Migrate your data to new systems without disrupting operational workflows

  • Operational Continuity: Downstream systems continue to work with stable, consistent IDs

  • Intelligent Mapping: Automatically maximizes the reuse of original ZINGG IDs based on primary key overlap

  • Easy Integration: No need to update downstream systems when you improve your model

How It Works

The reassign phase compares two configurations:

  1. Original Configuration: Your production model with established ZINGG IDs

  2. New Configuration: Your updated model with improved features, new infrastructure, or schema changes

The reassign phase maximizes the assignment of original ZINGG IDs by:

  • Identifying clusters in the new output that overlap with original clusters (via primary key matching)

  • Reassigning the original ZINGG IDs to matching clusters wherever possible

  • Generating new ZINGG IDs only when no match is found in the original clusters

Usage

Command Line

Key Parameters:

  • --conf: Points to a wrapper configuration that references your new/updated model config and specifies the output location for reassigned results

  • --originalZinggId: Points to your original production configuration

  • --properties-file: Zingg properties file (optional)

Note: The --conf parameter requires a wrapper configuration (see Configuration Wrapper section below) that includes a transformedOutputPath to specify where the reassigned output should be written. This is different from a plain model configuration file.

Python API

Python users can pass configuration objects directly instead of JSON files.

Configuration Files

Configuration Wrapper (examples/febrl/sparkIncremental/configReassign5M.json)

This wrapper configuration points to your new configuration and specifies where to write the reassigned output:

Note:

  • This wrapper only points to the new configuration. The original configuration is specified via the --originalZinggId command-line parameter.

  • The name field in transformedOutputPath can be any arbitrary identifier for the output pipe - it's used internally by Zingg to identify this output destination.

New Configuration (examples/febrl5M/configUpdated.json)

Your updated model configuration with changes—for example, switching some fields from fuzzy to exact matching:

Original Configuration (examples/febrl5M/config.json)

Your production configuration with established ZINGG IDs:

Output Format

The output is written to the transformedOutputPath specified in your configuration and contains the same structure as the new model's output, but with reassigned ZINGG IDs that maximize preservation of the original IDs.

Use Cases

1. Model Upgrades

When you add new features like nicknames, better blocking strategies, or improved classifiers, use reassignZinggId to maintain ID continuity.

2. Infrastructure Migration

Moving from Spark to Snowflake, or vice versa? Reassign helps you maintain the same ZINGG IDs across platforms.

3. Schema Evolution

If your data schema changes (new fields, removed fields, different field types), reassignZinggId preserves IDs for records that can still be matched via primary key.

4. Data Migration

Migrating from Databricks to Fabric, or vice versa? When migrating data to new systems or consolidating data sources, maintain ID consistency across the migration.

5. Platform Upgrades

Upgrading Spark versions, Snowflake features, or other platform components? Keep your IDs stable through the upgrade process.

Last updated

Was this helpful?