Zingg
  • Welcome to Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing custom data and config files
        • Shared locations
        • File read/write permissions
        • Copying Files To and From the Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input and Output
        • Input Data
        • Output
      • Field Definitions
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using pre-existing training data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pre-trained models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting bugs and contributing
    • Setting Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page
  1. Improving Accuracy

Ignoring Commonly Occuring Words While Matching

Common words like Mr, Pvt, Av, St, Street etc do not add differential signal and confuse matching. These words are called stopwords and matching is more accurate when stopwrods are ignored.

In order to remove stopwords from a field, configure

The stopwords can be recommended by Zingg by invoking

./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>

By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file:

stopWordsCutoff: <a value between 0 and 1>

Once you have verified the above stop words, you can configure them in the JSON variable stopWords with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain one word per row.

"fieldDefinition":[
   	{
   		"fieldName" : "fname",
   		"matchType" : "fuzzy",
   		"fields" : "fname",
   		"dataType": "string",
   		"stopWords": "models/100/stopWords/fname.csv"
   	},
PreviousImproving AccuracyNextDefining Domain Specific Blocking And Similarity Functions

Last updated 6 months ago