Data Engineering and Validation

From the Ground Up

Author

Michael Mullarkey

Published

September 27, 2023

Preface

Why Does This Book Exist?

Data engineering and validation is the practice of systematically putting together accurate, well-structured data.

Data engineering and validation is a multi-billion dollar industry.

Data engineering and validation is absent at many smaller organizations.¹

At first glance, these statements might seem contradictory. If the processes that help ensure accurate, usable data exude value then wouldn’t all organizations would jump to implement them?

However, the value of data engineering and validation is precisely what can put it out of reach for smaller organizations. Valuable tools are expensive ones, and common data engineering and validation tools are often outside smaller organizations’ budgets.²

Know-how of multiple indsutry-standard programming languages³ and technologies⁴ are possible to find within smaller organizations. And those skills are valuable enough that retaining people with those skills is far from a guarantee.

Beyond potentially prohibitive costs for tools and talent, most data engineering and validation systems necessary for larger-scale industry applications are overkill for smaller organizations. Many publicly available resources are great if your organization is processing millions of rows of data and over-engineering if your organization is processing thousands of rows.⁵

This resource gap is part of what drives the lack of systematic, repeatable processes for creating accurate datasets at smaller organizations. This book aims to help fill that gap.

What Will I Get Out of This Book?

We’ll work on data engineering and validation from the ground up using R, an open-source language that’s familiar to many people in smaller organizations. You won’t have to touch any other programming languages or technologies to implement everything in this book.

We’ll start with how to create a pipeline using one remote dataset,⁶ learn how to debug our new pipeline, add automated tests for data quality, extend the pipeline to multiple datasets, and perform certain automated tests on some of those datasets but not others.

This book will give you thte building blocks to create a systematic, repeatable, and scalable,⁷ pipeline for engineering accurate data within the R ecosystem.

Who is This Book For?

This book is primarily geared toward people working at smaller organizations⁸ with at least intermediate experience in R.

People interested in data engineering in general with a programming background outside R might also appreciate the “from the ground up” approach. Other data engineering resources sometimes jump straight to a very high complexity level or assume a computer science background.⁹ If you’re looking for an introduction to data engineering and validation at industry scale I’d recommend checking out the Seattle Data Guy’s Youtube Videos.

I don’t want to gatekeep anyone, and this book is unlikely to be a good way to acquaint yourself with R. If you’re looking for an introductory experience for the R programming language, I’d recommend Applied Data Skills by Emily Nordmann and Lisa DeBruine.

This website is and will always be free, licensed under the CC BY-NC-ND 3.0 License.

Who Wrote This Book?

I’m Michael Mullarkey, a clinical psychology PhD who solves problems where data science and product overlap.

I’ve also put together a lot of data engineering and validation pipelines for smaller organizations. I’m a former academic who has worked multiple types of data jobs in industry. I enjoy helping my cat see birds in the window, watching horror movies with my spouse, and bringing a social science lens to software engineering.

Or at least a systematic, repeatable process is absent↩︎
Sure, there are plenty of open source tools that enable data engineering + validation… and the most user-friendly ways to implement those tools are often in pay-to-use environments.↩︎
Like Python and SQL↩︎
Like Docker↩︎
Or even less!↩︎
Instead of having to download a file locally↩︎
At least to ~hundreds of thousands of rows↩︎
Think acaedmic labs, non-profits, local governments, etc.↩︎
This book does not assume a computer science background↩︎