Zingg-0.3.4
  • Welcome to Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing custom data and config files
        • Shared locations
        • File read/write permissions
        • Copying Files To and From the Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input and Output
        • Input Data
        • Output
      • Field Definitions
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using pre-existing training data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
  • Working With Python
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pre-trained models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting bugs and contributing
    • Setting Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook
On this page
  1. Running Zingg on Cloud

Running on Databricks

Instructions on working with Databricks

PreviousRunning on AzureNextZingg Models

Last updated 2 years ago

The cloud environment does not have the system console for the labeler to work. Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud.

Please refer to the for a detailed tutorial.

Zingg is run as a Spark Submit task with the following parameters :

{
"settings": {
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*, 4]"
},
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"first_on_demand": 1,
"zone_id": "us-west-2a"
},
"node_type_id": "c5d.xlarge",
"driver_node_type_id": "c5d.xlarge",
"custom_tags": {
"ResourceClass": "SingleNode"
}
},
"spark_submit_task": {
"parameters": [
"--class",
"zingg.client.Client",
"dbfs:/FileStore/jars/e34dca2a_84a4_45fd_a9fd_fe0e1c7e283c-zingg_0_3_0_SNAPSHOT-aa6ea.jar",
"--phase=findTrainingData",
"--conf=/dbfs/FileStore/config.json",
"--license=abc"
]
},
"email_notifications": {},
"name": "test",
"max_concurrent_runs": 1
}
}

The config file for Databricks needs modifications to accept dbfs locations. Here is a sample config that worked :

{	
	"fieldDefinition":[
		{
			"fieldName" : "fname",
			"matchType" : "email",
			"fields" : "fname",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "lname",
			"matchType" : "fuzzy",
			"fields" : "lname",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "stNo",
			"matchType": "fuzzy",
			"fields" : "stNo",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "add1",
			"matchType": "fuzzy",
			"fields" : "add1",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "add2",
			"matchType": "fuzzy",
			"fields" : "add2",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "city",
			"matchType": "fuzzy",
			"fields" : "city",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "state",
			"matchType": "fuzzy",
			"fields" : "state",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "dob",
			"matchType": "fuzzy",
			"fields" : "dob",
			"dataType": "\"string\"" 
		},
		{
			"fieldName" : "ssn",
			"matchType": "fuzzy",
			"fields" : "ssn",
			"dataType": "\"string\"" 
		}
		],
		"output" : [{
			"name":"output", 
			"format":"csv", 
			"props": {
				"location": "/febrl120/zinggOutput",
				"delimiter": ",",
				"header":true
			}
		}],
		"data" : [{
			"name":"test", 
			"format":"csv", 
			"props": {
				"location": "/FileStore/test.csv",
				"delimiter": ",",
				"header":false					
			},
			"schema": 
				"{\"type\" : \"struct\",
				\"fields\" : [ 
					{\"name\":\"id\", \"type\":\"string\", \"nullable\":false}, 
					{\"name\":\"fname\", \"type\":\"string\", \"nullable\":true},
					{\"name\":\"lname\",\"type\":\"string\",\"nullable\":true} ,
					{\"name\":\"stNo\", \"type\":\"string\", \"nullable\":true}, 
					{\"name\":\"add1\", \"type\":\"string\", \"nullable\":true},
					{\"name\":\"add2\",\"type\":\"string\",\"nullable\":true} ,
					{\"name\":\"city\", \"type\":\"string\", \"nullable\":true}, 
					{\"name\":\"state\", \"type\":\"string\", \"nullable\":true},
					{\"name\":\"dob\",\"type\":\"string\",\"nullable\":true} ,
					{\"name\":\"ssn\",\"type\":\"string\",\"nullable\":true}
				]
			}"
		}],
		"labelDataSampleSize" : 0.1,
		"numPartitions":5000,
		"modelId": 101,
		"zinggDir": "/models"

}
Databricks Zingg tutorial