Zingg
  • Welcome To Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing Custom Data And Config Files
        • Shared Locations
        • File Read/Write Permissions
        • Copying Files To And From The Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Enterprise Installation for Snowflake
        • Setting up Zingg
        • Snowflake Properties
        • Match Configuration
        • Running Asynchronously
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input And Output
        • Input Data
        • Output
      • Field Definitions
      • User Defined Mapping Match Types
      • Deterministic Matching
      • Pass Thru Data
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using Pre-existing Training Data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Verification of Blocking Model
    • Building And Saving The Model
    • Finding The Matches
    • Adding Incremental Data
    • Linking Across Datasets
    • Explanation of Models
    • Approval of Clusters
    • Combining Different Match Models
    • Model Difference
    • Persistent ZINGG ID
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
    • Python API
  • Running Zingg On Cloud
    • Running On AWS
    • Running On Azure
    • Running On Databricks
    • Running on Fabric
  • Zingg Models
    • Pre-Trained Models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting Bugs And Contributing
    • Setting Up Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page
  • Properties for reading data from BigQuery:
  • Properties For Writing Data To BigQuery:
  • Notes:

Was this helpful?

Edit on GitHub
  1. Data Sources and Sinks

BigQuery

PreviousParquetNextExasol

Last updated 5 months ago

Was this helpful?

Zingg can seamlessly work with Google BigQuery. Please find below details about the properties that must be set.

The two driver jars namely spark-bigquery-with-dependencies_2.12-0.24.2.jar and gcs-connector-hadoop2-latest.jar are required to work with BigQuery. To include these BigQuery drivers in the classpath, please configure the to include these

spark.jars=./spark-bigquery-with-dependencies_2.12-0.24.2.jar,./gcs-connector-hadoop2-latest.jar

In addition, the following property needs to be set

spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem                                                      

If Zingg is run from outside the Google cloud, it requires authentication, please set the following environment variable to the location of the file containing the service account key. A service account key can be created and downloaded in JSON format from the .

export GOOGLE_APPLICATION_CREDENTIALS=path to google service account key file

Connection properties for BigQuery as a data source and data sink are given below. If you are curious to know more about how Spark connects to BigQuery, you may look at the .

Properties for reading data from BigQuery:

The property credentialsFile should point to the Google service account key file location. This is the same path that is used to set variable GOOGLE_APPLICATION_CREDENTIALS. The table property should point to a BigQuery table that contains source data. The property viewsEnabled must be set to true only.

    "data" : [{
        "name":"test", 
         "format":"bigquery", 
        "props": {
            "credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
            "table": "mynotification-46566.zinggdataset.zinggtest",
            "viewsEnabled": true
        }
    }],

Properties For Writing Data To BigQuery:

To write to BigQuery, a bucket needs to be created and assigned to the temporaryGcsBucket property.

    "output" : [{
        "name":"output", 
        "format":"bigquery",
        "props": {
            "credentialsFile": "/home/work/product/final/zingg-1/mynotification-46566-905cbfd2723f.json",
            "table": "mynotification-46566.zinggdataset.zinggOutput",
            "temporaryGcsBucket":"zingg-test",
         }
    }],

Notes:

  • A typical service account key file looks like below (JSON).

{
 "type": "service_account",
 "project_id": "mynotification-46566",
 "private_key_id": "905cbfd273ff9205d1cabfe06fa6908e54534",
 "private_key": "-----BEGIN PRIVATE KEY-----CERT.....",
 "client_id": "11143646541283115487",
 "auth_uri": "https://accounts.google.com/o/oauth2/auth",
 "token_uri": "https://oauth2.googleapis.com/token",
 "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
 "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/zingtest%44mynotification-46566.iam.gserviceaccount.com"
}

The library "gcs-connector-hadoop2-latest.jar" can be downloaded from and the library "spark-bigquery-with-dependencies_2.12-0.24.2" from the .

runtime properties
Google Cloud console
Spark BigQuery connector documentation
Google
maven repo