Zingg
  • Welcome to Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing custom data and config files
        • Shared locations
        • File read/write permissions
        • Copying Files To and From the Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input and Output
        • Input Data
        • Output
      • Field Definitions
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using pre-existing training data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pre-trained models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting bugs and contributing
    • Setting Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page
  • 1. Blocking Model
  • 2. Similarity Model

Zingg Models

PreviousRunning on DatabricksNextPre-trained models

Last updated 6 months ago

Zingg learns two models from the data.

1. Blocking Model

One fundamental problem with scaling data mastering is that the number of comparisons increases quadratically as the number of input records increases.

Zingg learns a clustering/blocking model which indexes near similar records. This means that Zingg does not compare every record with every other record. Typical Zingg comparisons are 0.05-1% of the possible problem space.

2. Similarity Model

The similarity model helps Zingg to predict which record pairs match. The similarity is run only on records within the same block/cluster to scale the problem to larger datasets. The similarity model is a classifier that predicts the similarity of records that are not exactly the same but could belong together.

To build these models, training data is needed. Zingg comes with an interactive learner to rapidly build training sets.

Data Mastering At Scale
Fuzzy matching comparisons
Shows records and asks user to mark yes, no, cant say on the cli.