Ignoring Commonly Occuring Words While Matching

Guide on omitting different types of words

Common words like Mr, Pvt, Av, St, Street, etc. do not add differential signals and confuse matching. These words are called stopwords and matching is more accurate when stopwords are ignored.

In order to remove stopwords from a field, configure.

The stopwords can be recommended by Zingg by invoking :

./scripts/zingg.sh --phase recommend --conf <conf.json> --columns <list of columns to generate stop word recommendations>

By default, Zingg extracts 10% of the high-frequency unique words from a dataset. If the user wants a different selection, they should set up the following property in the config file:

stopWordsCutoff: <a value between 0 and 1>

Once you have verified the above stop words, you can configure them in the JSON variable stopWords with the path to the CSV file containing them. Please ensure while editing the CSV or building it manually that it should contain one word per row.

"fieldDefinition":[
   	{
   		"fieldName" : "fname",
   		"matchType" : "fuzzy",
   		"fields" : "fname",
   		"dataType": "\"string\"",
   		"stopWords": "models/100/stopWords/fname.csv"
   	},

Last updated