Zingg
  • Welcome To Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing Custom Data And Config Files
        • Shared Locations
        • File Read/Write Permissions
        • Copying Files To And From The Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Enterprise Installation for Snowflake
        • Setting up Zingg
        • Snowflake Properties
        • Match Configuration
        • Running Asynchronously
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input And Output
        • Input Data
        • Output
      • Field Definitions
      • User Defined Mapping Match Types
      • Deterministic Matching
      • Pass Thru Data
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using Pre-existing Training Data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Verification of Blocking Model
    • Building And Saving The Model
    • Finding The Matches
    • Adding Incremental Data
    • Linking Across Datasets
    • Explanation of Models
    • Approval of Clusters
    • Combining Different Match Models
    • Model Difference
    • Persistent ZINGG ID
  • Data Sources and Sinks
    • Zingg Pipes
    • Databricks
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • AWS S3
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
    • Exasol
  • Working With Python
    • Python API
  • Running Zingg On Cloud
    • Running On AWS
    • Running On Azure
    • Running On Databricks
    • Running on Fabric
  • Zingg Models
    • Pre-Trained Models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting Bugs And Contributing
    • Setting Up Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook

@2021 Zingg Labs, Inc.

On this page
  • Step 1: Install
  • Step 2: Plan For Hardware
  • Step 3: Build The Config For Your Data
  • Step 4: Create Training Data
  • Step 5: Build & Save The Model
  • Step 6: Voila, Let's Match!

Was this helpful?

Edit on GitHub

Step-By-Step Guide

Instructions on how to install and use Zingg

PreviousWelcome To ZinggNextInstallation

Last updated 6 months ago

Was this helpful?

Step 1: Install

Installation instructions for docker, as well as GitHub release, are . If you need to build from the sources or compile for a different flavor of Spark, check .

Step 2: Plan For Hardware

Decide your hardware based on the .

Step 3: Build The Config For Your Data

Zingg needs a configuration file that defines the data and what kind of matching is needed. You can create the configuration file by following the instructions .

Step 4: Create Training Data

Zingg builds a new set of models(blocking and similarity) for every new schema definition(columns and match types). This means running the findTrainingData and label phases multiple times to build the training dataset from which Zingg will learn. You can read more .

Step 5: Build & Save The Model

The training data in Step 4 above is used to train Zingg and build and save the models. This is done by running the train phase. Read more .

Step 6: Voila, Let's Match!

As long as your input columns and the field types are not changing, the same model should work and you do not need to build a new model. If you change the match type, you can continue to use the training data and add more labeled pairs on top of it.

It's now time to apply the model to our data. This is done by running the match or the link phases depending on whether you are matching within a single source or linking multiple sources respectively. You can read more about and .

matching
linking
here
performance numbers
here
here
here
compiling