When you evolve your Zingg model—changing field definitions from fuzzy to exact, adding new match types, introducing deterministic matching rules, or tuning blocking strategies—you need to understand the impact. The diff phase provides a detailed comparison between two model outputs, showing you exactly which records and clusters were affected by your changes. Make data-driven decisions about which model performs better for your use case.
Why Use Diff?
Model Comparison: Understand the impact of changing field types (fuzzy → exact, adding new match types)
Validation & Testing: Verify that model improvements actually produce better results
Impact Analysis: See which specific records and clusters changed between model versions
Confidence Building: Make informed decisions about deploying new models to production
A/B Testing: Compare different model configurations to find the optimal setup
Regression Detection: Ensure new model changes don't degrade match quality
Cluster Evolution: Track how clusters merge, split, or change as your model evolves
How It Works
The diff phase compares two configurations:
Original Configuration: Your baseline model (e.g., with fuzzy matching)
New Configuration: Your updated model (e.g., with exact matching or new rules)
The diff phase analyzes both outputs and identifies:
Records that moved between clusters due to model changes
New clusters created in the updated model
Merged or split clusters between the two models
Usage
Command Line
Key Parameters:
--conf: Points to your new/updated configuration (the model you want to evaluate)
--compareTo: Points to your original/baseline configuration (the model you're comparing against)
This wrapper configuration points to your new configuration and specifies where to write the diff output:
Note:
This wrapper only points to the new configuration. The original/baseline configuration is specified via the --compareTo command-line parameter.
The name field in transformedOutputPath can be any arbitrary identifier for the output pipe - it's used internally by Zingg to identify this output destination. Here we use diffOutput to clearly distinguish it from the new model's match output.
New Configuration (examples/febrl/configdiffUpdated.json)
Your updated model configuration with changes—for example, switching some fields from fuzzy to exact matching:
Original Configuration (examples/febrl/configBaseline.json)
Your baseline model configuration (the one you're comparing against):
Output Format
The diff output contains only the records that were impacted by the model changes, making it easy to focus on what actually changed.
Output Fields
The diff output includes:
Primary Keys of records from both configurations
ZINGG_ID_UPDATED: ZINGG ID from the new/updated model
ZINGG_ID_ORIGINAL: ZINGG ID from the original/baseline model
This allows you to see side-by-side how each record's cluster assignment changed between the two models.
Enhanced Diff Output with Outer Join
Coming Soon. We're developing an enhanced diff that will provide even more comprehensive comparison capabilities. This enhanced diff will include:
All records from both models, including those that appear in only one output
Enhanced null handling: Intelligent merging of primary key columns when records appear in only one model
Use Cases
1. Field Type Changes
Compare fuzzy vs. exact matching to find the right balance between precision and recall.
2. Deterministic Matching Evaluation
See how adding or changing deterministic matching rules affects your clusters.
3. Blocking Strategy Comparison
Compare DEFAULT vs. WIDER blocking strategies to understand the performance/accuracy trade-off.
4. Model Validation
Validate that model improvements (nicknames, better classifiers, etc.) actually improve results.
5. Regression Testing
Ensure that model updates don't accidentally degrade match quality for specific data patterns.
6. Configuration Tuning
Experiment with different labelDataSampleSize, numPartitions, or other parameters and see the impact.