Scalable Automated Config-Driven Data Validation with ValiData

March 7, 2024

Co-authors: Co-authored byBharadwaj Jayaraman and Co-authored byManoj Kumar

As the world's largest professional network company, LinkedIn processes and stores massive volumes of data daily. To keep up with the fast-changing business requirements, these datasets often evolve continuously over time due to changes in business strategy, tech stack migrations of underlying platforms, addition or removal of a set of features, changes to specific field definitions, changes to filter conditions for generating the data, etc. On average, approximately 150 changes are made to data pipelines every week. This makes it essential to ensure that the data available is consistent, accurate and meets the designated standards, or in other words, the data must be ‘valid.’

ValiData is a scalable automated config-driven data validation tool extensively used in LinkedIn that compares metric values of test datasets against production or source-of-truth datasets and highlights differences in metric values across dimensions. Its users only need to specify metrics and dimensions to validate in a configuration file and run ValiData. The tool automatically generates queries for validations, executes these queries, compares the results, and generates detailed reports highlighting the differences between the datasets being compared. These reports are then sent through email to the users, who can quickly review them to determine if the data is valid or not.

Typically data validation is performed manually by running a set of queries on the datasets to fetch the number of records and certain aggregated metric values and comparing them with those of existing source-of-truth datasets. Because these manual validations are usually time-consuming, repetitive, incomplete, and prone to human errors, there is an inherent need to automate the data validation process. Manual validation of a single dataset can take approximately one hour. For example, if the validation queries contain errors, the user might need to perform additional runs which might take an additional hour, meaning that a typical data validation task can take ~2 hours to complete. With ValiData the user only needs to analyze the schema, prepare a configuration, and submit it, which means the entire validation process for a typical dataset is complete in ~15 minutes, thereby reducing the manual effort by more than 85%.

Motivation

Let us consider the following scenario: Bob, an engineer at LinkedIn, creates a dataset tracking daily job applications per member. This dataset becomes the primary source for identifying active job seekers. To enrich it, Bob adds member details like country, language, current company, industry, and education by linking with the respective dimension tables. However, when incorporating education details, which may have multiple records for members with degrees from different schools, duplicate records emerge, leading to inflated metric values. To validate the changes, Bob manually writes queries for each metric and dimension, which is quite time-consuming. Also, while writing these queries, Bob might handpick a few members and try to validate that their corresponding metrics are the same as before. In this case, he might miss detecting the discrepancy in the metric values.

With ValiData, Bob simply has to specify the existing metrics and dimensions in the configuration and submit a ValiData job. It automatically generates all the queries, executes them, compares the corresponding results, and highlights possible discrepancies, thereby greatly reducing the manual effort involved and ensuring accuracy in performing data validations.

What is ValiData?

At its core, Validata aims to automate the complete data validation process, regardless of the application that generated the data, the schema it follows, the storage format, or the location where it resides. Our goal is to provide a comprehensive, one-stop solution for all data validation needs, intending to improve user experience by eliminating manual steps and boosting developer productivity. By automating the validation process and providing detailed reports on discrepancies, ValiData helps data engineers and developers save time and ensure the accuracy and consistency of their data.

How does it work?

ValiData accepts a configuration file that indicates what to validate and how to validate. It contains information such as table names, schemas, filters to be applied to the tables, the metrics and dimensions to be compared, the type of validation to be performed, etc. Using this information, ValiData automatically generates multiple aggregation queries and executes them in sequence. The results of these queries are then compared and detailed validation reports are generated based on the outcomes of these comparisons, which are then mailed to the user. Spark is used for executing the queries, which guarantees great query execution speeds and also makes it interoperable with multiple formats and storages.

In the following section, we will cover some of the salient features of ValiData and what can be accomplished using ValiData.

Features

Extensible: At the heart of ValiData lies a Spark application that executes queries and compares the results. Although ValiData is currently supported only for Hive tables and views, it is well-known that Spark supports a wide variety of data formats such as JSON, CSV, parquet, AVRO, ORC, etc, and data sources such as Hive, DB2, MS SQL, S3, Azure Data Lake, etc., which allows us to extend ValiData to other data sources. Similarly, the validation reports are generated in Excel format (.xls) and emailed to the users but can be easily extended to other formats and delivery mechanisms.
Lightweight interface: The user-facing ValiData client application is exceptionally light. It requires the user to provide a few inputs such as the configuration file path, cluster, email ID to send the report, etc., and makes API calls to the Azkaban server, which does all the heavy lifting including query generations, query executions, and report delivery.
User-friendly: The lightweight user interface makes it possible to submit the ValiData job in under a minute. Once submitted, the user can focus on other tasks as they do not have to wait for the job to be completed; the validation reports will be automatically mailed to the user once the job is completed.
Scalable: ValiData can perform bulk validation for hundreds of datasets within a single validation job, presenting a particularly advantageous feature in the context of tech stack migrations. Users can streamline the process by simply compiling a list of all the tables and the corresponding metrics to be validated in a single configuration file, which is then submitted as a ValiData job. ValiData efficiently breaks down these validations into smaller, manageable batches, and subsequently dispatches individual validation reports via email for each batch.
Fault-tolerant: While executing bulk validations as explained above, ValiData places a premium on fault tolerance, ensuring that a single validation failure does not disrupt the entire validation process. While running a large number of validations, isolated validation failures may occur due to unforeseen issues such as runtime errors, incorrect user inputs, and malformed configurations. In response to such instances, ValiData proceeds to generate validation reports for the unaffected validations, while a distinct email communication is sent to report the details of the failed validations, allowing for targeted attention and resolution.

Common Usage Scenarios

Compare production and test datasets: One of the most common use cases of ValiData is to compare production and test datasets to validate the effect of a tech stack migration on the dataset, changes to the definition of one or more fields in the dataset, and the addition or removal of one or more fields in the dataset.
Statistical summary-based validation: For newly created datasets, where there is nothing to compare against, we can simply run ValiData to generate metrics across different dimension fields, without performing any comparisons. It is also possible to generate a statistical summary report containing the following values for the configured metrics: count, mean, minimum, maximum, standard deviation, 25, 50, and 75th percentile values. This will give an indication of the distribution of metric values for a given metric in the dataset, which in turn can help quickly detect anomalous metric values, for example, if all records have the same numerical value for a particular metric.
Snapshot dataset validation: Snapshot datasets, i.e., datasets not containing any metrics can also be validated with ValiData. This is done by comparing the total number of records and the unique number of records for a given dimensional value between the production and test datasets.

Impact

The latest major release of ValiData was introduced in February 2023. Since its inception, the tool has facilitated over 2000 validations conducted by a diverse group of 100+ unique users, resulting in an estimated 85% reduction in manual effort per validation.

In a recent company-wide initiative to migrate data pipelines to the latest technology stack, ValiData has emerged as the go-to tool for conducting post-migration checks. This underscores ValiData's pivotal role in streamlining the validation process and safeguarding data integrity throughout the migration. Particularly noteworthy is ValiData's adeptness in bulk validation, enabling users to seamlessly validate hundreds of datasets at once. This distinctive capability, a recently added feature to specifically aid such migrations, reinforces ValiData's standing as an indispensable asset in navigating the intricacies of complex data migration endeavors.

Platform Architecture

The following diagram illustrates the end-to-end architecture of ValiData and its components.

Diagram of the ValiData Architecture — Figure 1. ValiData Architecture

Note: Components represented in dotted lines are not part of the current system, but are added to indicate the extensible nature of ValiData).

The following sections cover the functionalities of the individual components.

ValiData client tool

A lightweight customer-facing client application that accepts a validation configuration and makes an API call to submit a ValiData job with the encoded configuration as a parameter.

For first-time users, the client also requires a few additional inputs such as a validation project name, cluster, paths to save validation outputs, etc. It remembers these inputs and reuses them in subsequent runs thereby making the validation job submission process quick.

Configuration Readers

The first step in a ValiData job is to parse the ValiData configuration supplied by the user and prepare a list of validation objects. Each validation object contains attributes to perform a specific validation task. These attributes include names of tables to be compared, the list of metrics and dimensions, filter conditions to be applied, the type of validation to be performed, and the thresholds for metric differences.

In the next section, we will go through a sample validation and a sample configuration with individual fields explained.

Workflow Builder

The validation objects generated by the Configuration Readers are passed to the Workflow Builder. The Workflow Builder generates workflows based on the validation type i.e., an additive metric validation would create an additive workflow and vice-versa. These workflows, when executed, generate a list of validation work units that contain the final aggregated metric values that need to be compared and/or reported.

To generate the validation work units, the Workflow Builder adds to the workflow, two additional modules related to query processing:

Query Generator

Each validation work unit involves running queries on the selected execution engine to obtain the desired results. The query generator makes use of the table names, metrics, dimensions, and filter conditions available in the work unit and generates SQL queries to generate the metrics at the desired granularities for each of the dimensions.

In case a production dataset is supplied, there will be separate queries generated for the production and test datasets. The results of these queries will then be compared to obtain the percentage differences between the production and test datasets for each of the metric values.

Query Executor

As the name suggests, the query executor executes the production and test queries generated by the Query Generator and generates the final Validation Work Unit.

Data Validator

The Data Validator executes the workflows that result in queries getting generated by the query generator, the execution of the generated queries by the Query Executor, and finally the creation of Validation Work Units containing the results of the executed queries.

The Validation Work Units contain the aggregated metric values for both production and test datasets across the different dimensions provided. The data validator now simply has to compare these values and calculate the percentage differences in these values.

Data Writer

The production and test metric values and the computed metric differences are then written to a report that needs to be emailed to the users. Currently, ValiData supports Excel reports that contain separate sheets to display the corresponding metric values and differences for each dimension. It also uses cell background color to highlight metric differences that exceed the configured thresholds

Report Sender

The last stage is to send the reports generated in the previous step as email attachments. The email body will also indicate the list of validations in case of multiple validations and list out the failures if any.

Limitations

Schema Detection

ValiData requires users to provide a configuration file that specifies the metrics, dimensions, and table names for the dataset. While there are some customized tools available to auto-generate the configuration file, they may not always meet the needs of every user. In some cases, manual editing of the auto-generated file may be necessary before submitting it to ValiData.

Summarization for large datasets

ValiData generates a detailed report that compares metric values across multiple dimensions. However, for datasets with more than 10 metrics and/or dimensions, it might be difficult to manually scan the report for inconsistencies. To address this issue, an additional functionality could be added to ValiData, that summarizes the key inconsistencies using GenAI and presents them in a more easily digestible format. This would allow users to quickly identify any potential issues and take appropriate action to resolve them.

Sample Data Validation

Scenario

Let us consider a sample Hive table jobs_db.job_applications, that stores the job applications at a daily granularity. It must contain the following fields:

job_id: Unique ID for the job
job_country: The country where the job is posted
job_industry: The industry to which the job belongs (for example: Marketing, Technology, Education, etc.)
job_company: The company to which the job belongs
job_title: The title for the job (for example: Software Engineer)
no_of_impressions: The number of times the job showed up on job listings and/or recommendations
no_of_apply_clicks: The number of times members clicked on the ‘Apply’ button
no_of_applicants: The unique number of applicants for the job
datepartition: The date on which the application was done. Also, the field used for partitioning the dataset

The job_id field is a primary key. Fields 6-8 are metrics as they are aggregated numerical values and fields 2-5 are dimensions that contain categorical values.

Suppose the user adds a new dimension field to this dataset. Let this new field be:

platform: Platform used for the job application (for example: mobile or web)

To test these changes, the user creates a test dataset jobs_test_db.job_applications with a similar schema and a newly added field called platform. The user wishes to ensure that the newly added dimension field does not affect the existing metric values. This can be quickly validated using ValiData.

Interface

The following image shows the ValiData client tool in action:

Configuration

In the above execution, validata_confing.conf is the configuration file that contains the validation details. For the example provided, the configuration would look like the following:

Image of code block for ValiData Interface

Note: All table and field names in the above configuration are for representation purposes only and do not represent actual datasets used at LinkedIn.

Execution Output

Overall validation

Dimension-wise validation - Additive

Image of dimension-wise validation table

Dimension-wise validation - Non-Additive

Image of dimension-wise validation table - non additive

Note: All metric values in the above example are artificially generated samples.

From the above reports, one can infer that although the overall metric values are matching, there are minor differences in the values for SALES and FINANCIAL SERVICES industries, which need to be corrected.

Conclusion and Future Steps

ValiData started as a venture project with the sole aim of improving developer productivity and has seen through a wide range of enhancements and improvements. During its initial stages, users navigated through the intricacies of modifying its codebase, manually submitting validation jobs and retrieving reports from the servers where the application ran. However, fueled by continuous feedback and user interactions, ValiData has undergone a transformative evolution. Today, it stands as an effortlessly accessible tool, enabling users to execute one-click validations and receive comprehensive reports directly via email.

While considerable strides have been made, the journey towards an optimized user experience is ongoing. Exciting enhancements are on the horizon, with the imminent introduction of a user-friendly UI for ValiData. This upcoming feature will empower users to select various parameters effortlessly through dropdown options, marking a significant step forward in pursuing a more streamlined and intuitive validation process. Currently, the implementation is closely linked to Spark on Hadoop. Our objective is to enhance its versatility, making it more generic and adaptable to run seamlessly on various other platforms. Upon achieving this objective, ValiData stands poised to become an ideal candidate for open-sourcing, thereby catalyzing benefits for the broader data community.

The commitment to innovation persists, and the ValiData team looks forward to unveiling further improvements in the months ahead.

Acknowledgments

This tool would not have been possible without the guidance, support, and leadership provided by Shishir Sathe, and Raghu Yalamanchili. We are also grateful to the members of the Data Foundation Tools and Platforms team at LinkedIn for their continuous feedback and suggestions. We would like to extend a special thanks to Ajay Kutnikar for contributing valuable features that have simplified the usage of the tool. Finally, we would like to thank all the users of ValiData who have been instrumental in helping us reach the scale we have achieved today.

Data Management

LakeChime: A Data Trigger Service for Modern Data Lakes

Walaa Eldin Moustafa

May 2, 2024
Data

Evaluating the success of consumer generative AI products

Bonnie Barrilleaux

Mar 6, 2024
Improving Recruiting Efficiency with a Hybrid Bulk Data Proces...

Aditya Hegde

Jan 19, 2024