githubEdit

Model Difference

Comparison of two outputs with different models

Zingg Enterprise Feature

Understand Your Model Changes with Precision

When you evolve your Zingg model—changing field definitions from fuzzy to exact, adding new match types, introducing deterministic matching rules, or tuning blocking strategies—you need to understand the impact. The diff phase provides a detailed comparison between two model outputs, showing you exactly which records and clusters were affected by your changes. Make data-driven decisions about which model performs better for your use case.

Why Use Diff?

  • Model Comparison: Understand the impact of changing field types (fuzzy → exact, adding new match types)

  • Validation & Testing: Verify that model improvements actually produce better results

  • Impact Analysis: See which specific records and clusters changed between model versions

  • Confidence Building: Make informed decisions about deploying new models to production

  • A/B Testing: Compare different model configurations to find the optimal setup

  • Regression Detection: Ensure new model changes don't degrade match quality

  • Cluster Evolution: Track how clusters merge, split, or change as your model evolves

How It Works

The diff phase compares two configurations:

  1. Original Configuration: Your baseline model (e.g., with fuzzy matching)

  2. New Configuration: Your updated model (e.g., with exact matching or new rules)

The diff phase analyzes both outputs and identifies:

  • Records that moved between clusters due to model changes

  • New clusters created in the updated model

  • Merged or split clusters between the two models

Usage

Command Line

Key Parameters:

  • --conf: Points to your new/updated configuration (the model you want to evaluate)

  • --compareTo: Points to your original/baseline configuration (the model you're comparing against)

  • --properties-file: Zingg properties file (optional)

Python API

Python users can pass configuration objects directly instead of JSON files. See the complete example in diffExample.py below.

Configuration Files

Configuration Wrapper (examples/febrl/sparkIncremental/configdiff.json)

This wrapper configuration points to your new configuration and specifies where to write the diff output:

Note:

  • This wrapper only points to the new configuration. The original/baseline configuration is specified via the --compareTo command-line parameter.

  • The name field in transformedOutputPath can be any arbitrary identifier for the output pipe - it's used internally by Zingg to identify this output destination. Here we use diffOutput to clearly distinguish it from the new model's match output.

New Configuration (examples/febrl/configdiffUpdated.json)

Your updated model configuration with changes—for example, switching some fields from fuzzy to exact matching:

Original Configuration (examples/febrl/configBaseline.json)

Your baseline model configuration (the one you're comparing against):

Output Format

The diff output contains only the records that were impacted by the model changes, making it easy to focus on what actually changed.

Output Fields

The diff output includes:

  • Primary Keys of records from both configurations

  • ZINGG_ID_UPDATED: ZINGG ID from the new/updated model

  • ZINGG_ID_ORIGINAL: ZINGG ID from the original/baseline model

This allows you to see side-by-side how each record's cluster assignment changed between the two models.

Enhanced Diff Output with Outer Join

Coming Soon. We're developing an enhanced diff that will provide even more comprehensive comparison capabilities. This enhanced diff will include:

  • All records from both models, including those that appear in only one output

  • Enhanced null handling: Intelligent merging of primary key columns when records appear in only one model

Use Cases

1. Field Type Changes

Compare fuzzy vs. exact matching to find the right balance between precision and recall.

2. Deterministic Matching Evaluation

See how adding or changing deterministic matching rules affects your clusters.

3. Blocking Strategy Comparison

Compare DEFAULT vs. WIDER blocking strategies to understand the performance/accuracy trade-off.

4. Model Validation

Validate that model improvements (nicknames, better classifiers, etc.) actually improve results.

5. Regression Testing

Ensure that model updates don't accidentally degrade match quality for specific data patterns.

6. Configuration Tuning

Experiment with different labelDataSampleSize, numPartitions, or other parameters and see the impact.

Last updated

Was this helpful?