Zingg-0.3.3
  • Welcome to Zingg
  • Step By Step Guide
    • Installation
      • Working with Docker Image
    • Hardware Sizing
    • Configuration
    • Creating training data
      • findTrainingData
      • label
      • findAndLabel
      • Using preexisting training data
      • Exporting labeled data as csv
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Snowflake
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pretrained models
  • Improving Accuracy By Defining Own Functions
  • Generating Documentation
  • Output Scores
  • Security And Privacy
  • Updating Labeled Pairs
  • Reporting bugs and contributing
  • Community
  • Frequently Asked Questions
  • Reading Material
Powered by GitBook
On this page
  1. Step By Step Guide

Hardware Sizing

PreviousWorking with Docker ImageNextConfiguration

Last updated 2 years ago

Zingg has been built to scale. Performance is dependent on

  • The number of records to be matched.

  • The number of fields to be compared against each other.

  • The actual number of duplicates.

Here are some performance numbers you can use to determine the appropriate hardware for your data.

  • 120k records of examples/febrl120k/test.csv take 5 minutes to run on a 4 core, 10 GB RAM local Spark cluster.

  • 5m records of take ~4 hours on a 4 core, 10 GB RAM local Spark cluster.

  • 9m records with 3 fields - first name, last name, email take 45 minutes to run on AWS m5.24xlarge instance with 96 cores, 384 gb RAM

If you have upto a few million records, it may be easier to run Zingg on a single machine in Spark local mode.

North Carolina Voters