Zingg
  • Welcome To Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing Custom Data And Config Files
        • Shared Locations
        • File Read/Write Permissions
        • Copying Files To And From The Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Enterprise Installation for Snowflake
        • Setting up Zingg
        • Snowflake Properties
        • Match Configuration
        • Running Asynchronously
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input And Output
        • Input Data
        • Output
      • Field Definitions
      • User Defined Mapping Match Types
      • Deterministic Matching
      • Pass Thru Data
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using Pre-existing Training Data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Verification of Blocking Model
    • Building And Saving The Model
    • Finding The Matches
    • Adding Incremental Data
    • Linking Across Datasets
    • Explanation of Models
    • Approval of Clusters
    • Combining Different Match Models
    • Model Difference
    • Persistent ZINGG ID
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
    • Python API
  • Running Zingg On Cloud
    • Running On AWS
    • Running On Azure
    • Running On Databricks
    • Running on Fabric
  • Zingg Models
    • Pre-Trained Models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting Bugs And Contributing
    • Setting Up Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page
  • Zingg Entity Resolution Python Package
  • API Reference
  • Example API Usage

Was this helpful?

Edit on GitHub
  1. Working With Python

Python API

PreviousWorking With PythonNextRunning Zingg On Cloud

Last updated 6 months ago

Was this helpful?

Zingg Entity Resolution Python Package

Zingg Python APIs for entity resolution, identity resolution, record linkage, data mastering and deduplication using ML ()

NOTE

Requires python 3.6+; spark 3.5.0. Otherwise, cannot be executed

API Reference

Example API Usage

from zingg.client import *
from zingg.pipes import *

#build the arguments for zingg
args = Arguments()
#set field definitions
fname = FieldDefinition("fname", "string", MatchType.FUZZY)
lname = FieldDefinition("lname", "string", MatchType.FUZZY)
stNo = FieldDefinition("stNo", "string", MatchType.FUZZY)
add1 = FieldDefinition("add1","string", MatchType.FUZZY)
add2 = FieldDefinition("add2", "string", MatchType.FUZZY)
city = FieldDefinition("city", "string", MatchType.FUZZY)
areacode = FieldDefinition("areacode", "string", MatchType.FUZZY)
state = FieldDefinition("state", "string", MatchType.FUZZY)
dob = FieldDefinition("dob", "string", MatchType.FUZZY)
ssn = FieldDefinition("ssn", "string", MatchType.FUZZY)

fieldDefs = [fname, lname, stNo, add1, add2, city, areacode, state, dob, ssn]

args.setFieldDefinition(fieldDefs)
#set the modelid and the zingg dir
args.setModelId("100")
args.setZinggDir("models")
args.setNumPartitions(4)
args.setLabelDataSampleSize(0.5)

#reading dataset into inputPipe and settint it up in 'args'
#below line should not be required if you are reading from in memory dataset
#in that case, replace df with input df
schema = "id string, fname string, lname string, stNo string, add1 string, add2 string, city string, areacode string, state string, dob string, ssn  string"
inputPipe = CsvPipe("testFebrl", "examples/febrl/test.csv", schema)
args.setData(inputPipe)
outputPipe = CsvPipe("resultFebrl", "/tmp/febrlOutput")

args.setOutput(outputPipe)

options = ClientOptions([ClientOptions.PHASE,"match"])

#Zingg execution for the given phase
zingg = Zingg(args, options)
zingg.initAndExecute()

https://www.zingg.ai
zingg.client.Zingg()
Zingg Entity Resolution Package
zingg.client
Arguments
Arguments.copyArgs()
Arguments.createArgumentsFromJSON()
Arguments.createArgumentsFromJSONString()
Arguments.getArgs()
Arguments.getModelId()
Arguments.getZinggBaseModelDir()
Arguments.getZinggBaseTrainingDataDir()
Arguments.getZinggModelDir()
Arguments.getZinggTrainingDataMarkedDir()
Arguments.getZinggTrainingDataUnmarkedDir()
Arguments.setArgs()
Arguments.setColumn()
Arguments.setData()
Arguments.setFieldDefinition()
Arguments.setLabelDataSampleSize()
Arguments.setModelId()
Arguments.setNumPartitions()
Arguments.setOutput()
Arguments.setStopWordsCutoff()
Arguments.setTrainingSamples()
Arguments.setZinggDir()
Arguments.writeArgumentsToJSON()
Arguments.writeArgumentsToJSONString()
ClientOptions
ClientOptions.COLUMN
ClientOptions.CONF
ClientOptions.EMAIL
ClientOptions.LICENSE
ClientOptions.LOCATION
ClientOptions.MODEL_ID
ClientOptions.PHASE
ClientOptions.REMOTE
ClientOptions.ZINGG_DIR
ClientOptions.getClientOptions()
ClientOptions.getConf()
ClientOptions.getLocation()
ClientOptions.getOptionValue()
ClientOptions.getPhase()
ClientOptions.hasLocation()
ClientOptions.setOptionValue()
ClientOptions.setPhase()
FieldDefinition
FieldDefinition.getFieldDefinition()
FieldDefinition.setStopWords()
FieldDefinition.stringify()
Zingg
Zingg.execute()
Zingg.executeLabel()
Zingg.executeLabelUpdate()
Zingg.getArguments()
Zingg.getMarkedRecords()
Zingg.getMarkedRecordsStat()
Zingg.getMatchedMarkedRecordsStat()
Zingg.getOptions()
Zingg.getUnmarkedRecords()
Zingg.getUnmatchedMarkedRecordsStat()
Zingg.getUnsureMarkedRecordsStat()
Zingg.init()
Zingg.initAndExecute()
Zingg.processRecordsCli()
Zingg.processRecordsCliLabelUpdate()
Zingg.setArguments()
Zingg.setOptions()
Zingg.writeLabelledOutput()
Zingg.writeLabelledOutputFromPandas()
ZinggWithSpark
getDfFromDs()
getGateway()
getJVM()
getPandasDfFromDs()
getSparkContext()
getSparkSession()
getSqlContext()
initClient()
initDataBricksConectClient()
initSparkClient()
parseArguments()
zingg.pipes
BigQueryPipe
BigQueryPipe.CREDENTIAL_FILE
BigQueryPipe.TABLE
BigQueryPipe.TEMP_GCS_BUCKET
BigQueryPipe.VIEWS_ENABLED
BigQueryPipe.setCredentialFile()
BigQueryPipe.setTable()
BigQueryPipe.setTemporaryGcsBucket()
BigQueryPipe.setViewsEnabled()
CsvPipe
CsvPipe.setDelimiter()
CsvPipe.setHeader()
CsvPipe.setLocation()
InMemoryPipe
InMemoryPipe.getDataset()
InMemoryPipe.setDataset()
Pipe
Pipe.addProperty()
Pipe.getPipe()
Pipe.setSchema()
Pipe.toString()
SnowflakePipe
SnowflakePipe.DATABASE
SnowflakePipe.DBTABLE
SnowflakePipe.PASSWORD
SnowflakePipe.SCHEMA
SnowflakePipe.URL
SnowflakePipe.USER
SnowflakePipe.WAREHOUSE
SnowflakePipe.setDatabase()
SnowflakePipe.setDbTable()
SnowflakePipe.setPassword()
SnowflakePipe.setSFSchema()
SnowflakePipe.setURL()
SnowflakePipe.setUser()
SnowflakePipe.setWarehouse()
Module Index
Index
Search Page