Zingg-0.3.3
  • Welcome to Zingg
  • Step By Step Guide
    • Installation
      • Working with Docker Image
    • Hardware Sizing
    • Configuration
    • Creating training data
      • findTrainingData
      • label
      • findAndLabel
      • Using preexisting training data
      • Exporting labeled data as csv
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Snowflake
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pretrained models
  • Improving Accuracy By Defining Own Functions
  • Generating Documentation
  • Output Scores
  • Security And Privacy
  • Updating Labeled Pairs
  • Reporting bugs and contributing
  • Community
  • Frequently Asked Questions
  • Reading Material
Powered by GitBook
On this page
  1. Step By Step Guide
  2. Creating training data

findTrainingData

Finding pairs of records which could be similar to train Zingg

Zingg builds models to predict similarity. Training data is needed to build these models. The findTrainingData phase prompts Zingg to search for edge cases in the data. During findTrainingData, Zingg combs through the data samples and judiciously selects limited pairs for the user to mark. Zingg is very frugal about the training so that user effort is minimized and models can be built and deployed quickly.

This findTrainingData job writes the edge cases to the folder configured throgh zinggDir/modelId in the config.

./zingg.sh --phase findTrainingData --conf config.json

The findTrainingData phase is run first and then the label phase is run and this cycle repeated so that the Zingg models get smarter from user feedback.

PreviousCreating training dataNextlabel

Last updated 2 years ago