Running on Databricks
Instructions on working with Databricks
The cloud environment does not have the system console for the labeler to work. Zingg is run as a Spark Submit Job along with a python notebook-based labeler specially created to run within the Databricks cloud.
Please refer to the Databricks Zingg tutorial for a detailed tutorial.
Zingg is run as a Spark Submit task with the following parameters :
{
"settings": {
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"spark_conf": {
"spark.databricks.cluster.profile": "singleNode",
"spark.master": "local[*, 4]"
},
"aws_attributes": {
"availability": "SPOT_WITH_FALLBACK",
"first_on_demand": 1,
"zone_id": "us-west-2a"
},
"node_type_id": "c5d.xlarge",
"driver_node_type_id": "c5d.xlarge",
"custom_tags": {
"ResourceClass": "SingleNode"
}
},
"spark_submit_task": {
"parameters": [
"--class",
"zingg.client.Client",
"dbfs:/FileStore/jars/e34dca2a_84a4_45fd_a9fd_fe0e1c7e283c-zingg_0_3_0_SNAPSHOT-aa6ea.jar",
"--phase=findTrainingData",
"--conf=/dbfs/FileStore/config.json",
"--license=abc"
]
},
"email_notifications": {},
"name": "test",
"max_concurrent_runs": 1
}
}The config file for Databricks needs modifications to accept dbfs locations. Here is a sample config that worked :
Last updated