Zingg-0.3.4
  • Welcome to Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing custom data and config files
        • Shared locations
        • File read/write permissions
        • Copying Files To and From the Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input and Output
        • Input Data
        • Output
      • Field Definitions
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using pre-existing training data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
  • Working With Python
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pre-trained models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting bugs and contributing
    • Setting Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook
On this page
  1. Improving Accuracy

Defining Domain Specific Blocking And Similarity Functions

To tune Zingg for more accurate matching with higher recall

PreviousIgnoring Commonly Occuring Words While MatchingNextDocumenting The Model

Last updated 2 years ago

You can add your own which will be evaluated by Zingg to build the

The blocking tree works on the matched records provided by the user as part of the training. At every node, it selects the hash function and the field on which it should be applied so that there is the least elimination of the matching pairs. Say we have data like this :

Pair 1
firstname
lastname

Record A

john

doe

Record B

johnh

d oe

Pair 2
firstname
lastname

Rrecord A

mary

ann

Record B

marry

Let us assume we have hash function first1char and we want to check if it is a good function to apply to firstname :

Pair
Record
Output

1

Record A

j

1

Record B

j

2

Record A

m

2

Record B

m

There is no elimination in the pairs above, hence it is a good function.

Now let us try last1char on firstname :

Pair
Record
Output

1

Record A

n

1

Record B

h

2

Record A

y

2

Record B

y

Pair 1 is getting eliminated above, hence last1char is not a good function.

So, first1char(firstname) will be chosen. This brings near similar records together - in a way, clusters them to break the cartesian join.

These business-specific blocking functions go into and must be added to and .

Also, for similarity, you can define your own measures. Each dataType has predefined features, for example, fuzzy type is configured for Affine and Jaro.

You can define your own and use them.

blocking functions
blocking tree.
Hash Functions
HashFunctionRegistry
hash functions config
String
comparisons