Zingg
  • Welcome To Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing Custom Data And Config Files
        • Shared Locations
        • File Read/Write Permissions
        • Copying Files To And From The Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Enterprise Installation for Snowflake
        • Setting up Zingg
        • Snowflake Properties
        • Match Configuration
        • Running Asynchronously
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input And Output
        • Input Data
        • Output
      • Field Definitions
      • User Defined Mapping Match Types
      • Deterministic Matching
      • Pass Thru Data
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using Pre-existing Training Data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Verification of Blocking Model
    • Building And Saving The Model
    • Finding The Matches
    • Adding Incremental Data
    • Linking Across Datasets
    • Explanation of Models
    • Approval of Clusters
    • Combining Different Match Models
    • Model Difference
    • Persistent ZINGG ID
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
    • Python API
  • Running Zingg On Cloud
    • Running On AWS
    • Running On Azure
    • Running On Databricks
    • Running on Fabric
  • Zingg Models
    • Pre-Trained Models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting Bugs And Contributing
    • Setting Up Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page

Was this helpful?

Edit on GitHub
  1. Improving Accuracy

Ignoring Commonly Occuring Words While Matching

Common words like Mr, Pvt, Av, St, Street etc. do not add differential signals and confuse matching. These words are called stopwords and matching is more accurate when stopwords are ignored.

The stopwords can be recommended by Zingg by invoking:

./scripts/zingg.sh --phase recommend --conf <conf.json> --column <name of column to generate stopword recommendations>

The stopwords generated are stored at the location - models/100/stopWords/columnName. This will give you the list of stopwords along with their frequency.

By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file under the respective field for which they want stopwords:

stopWordsCutoff: <a value between 0 and 1>

Once you have verified the above stop words, you can configure them in the JSON variable stopWords with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain one word per row. Also, ensure that it has a header such as StopWords as Zingg ignores the header by default and works on the remaining data.

"fieldDefinition":[
   	{
   		"fieldName" : "fname",
   		"matchType" : "fuzzy",
   		"fields" : "fname",
   		"dataType": "string",
   		"stopWords": "models/100/stopWords/fname.csv"
   	},

In case of stopwords being set up manually by the user, the list of stopwords may consider multiple columns and Zingg used only the first column by default.

For recommending stopwords in Zingg Enterprise Snowflake,

./scripts/zingg.sh --phase recommend --conf <conf.json> --properties-file <path to Snowflake properties file> --column <name of column to generate stopword recommendations>

The stopwords generated are stored in the table - zingg_stopWords_columnName_modelId where we can see the list of stopwords and their frequency associated with the given column name.

PreviousImproving AccuracyNextDefining Domain Specific Blocking And Similarity Functions

Last updated 2 months ago

Was this helpful?