Zingg-0.3.4
  • Welcome to Zingg
  • Step-By-Step Guide
    • Installation
      • Docker
        • Sharing custom data and config files
        • Shared locations
        • File read/write permissions
        • Copying Files To and From the Container
      • Installing From Release
        • Single Machine Setup
        • Spark Cluster Checklist
        • Installing Zingg
        • Verifying The Installation
      • Compiling From Source
    • Hardware Sizing
    • Zingg Runtime Properties
    • Zingg Command Line
    • Configuration
      • Configuring Through Environment Variables
      • Data Input and Output
        • Input Data
        • Output
      • Field Definitions
      • Model Location
      • Tuning Label, Match And Link Jobs
      • Telemetry
    • Working With Training Data
      • Finding Records For Training Set Creation
      • Labeling Records
      • Find And Label
      • Using pre-existing training data
      • Updating Labeled Pairs
      • Exporting Labeled Data
    • Building and saving the model
    • Finding the matches
    • Linking across datasets
  • Data Sources and Sinks
    • Zingg Pipes
    • Snowflake
    • JDBC
      • Postgres
      • MySQL
    • Cassandra
    • MongoDB
    • Neo4j
    • Parquet
    • BigQuery
  • Working With Python
  • Running Zingg on Cloud
    • Running on AWS
    • Running on Azure
    • Running on Databricks
  • Zingg Models
    • Pre-trained models
  • Improving Accuracy
    • Ignoring Commonly Occuring Words While Matching
    • Defining Domain Specific Blocking And Similarity Functions
  • Documenting The Model
  • Interpreting Output Scores
  • Reporting bugs and contributing
    • Setting Zingg Development Environment
  • Community
  • Frequently Asked Questions
  • Reading Material
  • Security And Privacy
Powered by GitBook
On this page
  1. Step-By-Step Guide
  2. Configuration

Tuning Label, Match And Link Jobs

Requirements to optimize the performance

numPartitions

The number of Spark partitions over which the input data is distributed. Keep it equal to 20-30 times the number of cores. This is an important configuration for performance.

labelDataSampleSize

Fraction of the data to be used for training the models. Adjust it between 0.0001 and 0.1 to keep the sample size small enough so that it finds enough edge cases fast. If the size is bigger, the findTrainingData job will spend more time combing through samples. If the size is too small, Zingg may not find the right edge cases.

PreviousModel LocationNextTelemetry

Last updated 2 years ago