Adding Incremental Data
Building a continuously updated identity graph with new, updated, and deleted records
Zingg Enterprise Feature
Rerunning matching on entire datasets is wasteful, and we lose the lineage of matched records against a persistent identifier. Using the incremental flow feature in Zingg Enterprise, incremental loads can be run to match existing pre-resolved entities. The new and updated records are matched to existing clusters, and new persistent ZINGG_IDs are generated for records that do not find a match. If a record gets updated and Zingg Enterprise discovers that it is a more suitable match with another cluster, it will be reassigned. Cluster assignment, merge, and unmerge happens automatically in the flow. Zingg Enterprise also takes care of human feedback on previously matched data to ensure that it does not override the approved records.
The incremental phase is run as follows:
./scripts/zingg.sh --phase runIncremental --conf <location to incrementalConf.json>
Example incrementalConf.json:
{
"config" : "config.json",
"incrementalData": [{
"name":"customers_incr",
"format":"csv",
"props": {
"path": "test-incr.csv",
"delimiter": ",",
"header":false
},
"schema": "recId string, fname string, lname string, stNo string, add1 string, add2 string, city string, state string, areacode string, dob string, ssn string"
}
],
"outputTmp" :{
"name":"customers_incr_temp",
"format":"csv",
"props": {
"location": "/tmp/zinggOutput_febrl_tmp",
"delimiter": ",",
"header":true
}
}
} runIncremental can also be triggered using Python by invoking:
./scripts/zingg.sh --run examples/FebrlExample.py
Python Code Example:
Last updated
Was this helpful?