Zingg
  • Welcome To Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing Custom Data And Config Files
        • Shared Locations
        • File Read/Write Permissions
        • Copying Files To And From The Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Enterprise Installation for Snowflake
        • Setting up Zingg
        • Snowflake Properties
        • Match Configuration
        • Running Asynchronously
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input And Output
        • Input Data
        • Output
      • Field Definitions
      • User Defined Mapping Match Types
      • Deterministic Matching
      • Pass Thru Data
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using Pre-existing Training Data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Verification of Blocking Model
    • Building And Saving The Model
    • Finding The Matches
    • Adding Incremental Data
    • Linking Across Datasets
    • Explanation of Models
    • Approval of Clusters
    • Combining Different Match Models
    • Model Difference
    • Persistent ZINGG ID
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
    • Python API
  • Running Zingg On Cloud
    • Running On AWS
    • Running On Azure
    • Running On Databricks
    • Running on Fabric
  • Zingg Models
    • Pre-Trained Models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting Bugs And Contributing
    • Setting Up Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page

Was this helpful?

Edit on GitHub

Interpreting Output Scores

For each field(FNAME, NAME, EMAIL ..), Zingg computes multiple features and feeds them to a classifier. These features are typically different ways to compare strings. We consider string lengths and their differences, character differences, and which characters actually differed.

  • The shorter string pair ABCD and ABCE will be less similar than ABCDEF and ABCEEF.

  • Common typos, for e.g. m instead of n, are penalized less severely and will be scored higher than a word with a replaced by non-obvious character b.

  • The more changes you need to make to the strings to make them match, the less similar they will be.

  • Differences in the middle are penalized more than prefixes and suffixes.

No individual feature is perfect, but the whole is greater than the sum of its parts. That's where the strength of Zingg matching and the accuracy comes from.

The classifier finds the best curve to represent the data(features) and gives a final score. This score is dependent on the individual features but is not a linear function but a curve fit. The deterioration in similarity does happen as scores get lower, but we try very hard to find as many possible matches, and hence you will continue to get good quality matches at lower scores. However, the number of good matches at lower scores will be lesser than those at higher scores.

The threshold is automatically chosen so that you can pick up most of the results, with a balance toward accuracy(what is a right match) and recall(not missing out on potential matches). Unlike traditional systems, we do not want the user to worry too much about the cut-off, We optimize the cutoff and hence you may see records with scores less than the conventional 0.5.

A few things to keep in mind while interpreting the scores:

  • Matching is transitive, so if record A matches Record B and Record B matches C, we put records A, B, and C in the same cluster. That is how different records in a cluster get matched at various confidence levels to the rest and all these records showed up together.

  • Our recommendation is to keep a threshold below x% of the max score as suspect records - manual review/to be visited later/flow through but will lessen confidence. x will depend on how accurate you find the results and how much you want to control the outcome.

  • Keep cluster size above 4 or 5 for inspection. You could keep it irrespective of the score, or look at only those clusters whose z_minScore is 0. This will depend on what the results look like to you.

PreviousDocumenting The ModelNextReporting Bugs And Contributing

Last updated 5 months ago

Was this helpful?