Data Engineering on Azure RTM

My book, Data Engineering on Azure, which I announced in this blog post, is going to print soon. As I did with my previous book, Programming with Types, I'm writing another RTM post to talk about a few aspects of the process.

Title evolution

When I pitched the book to Manning, I used the title Production Data Engineering with Azure. The title was supposed to capture that this is a book about the practical aspects of data engineering, with examples on the Azure platform. In fact, here is how I described the book's topic in the proposal:

The same way Software Engineering brings engineering rigor to software development, Data Engineering aims to bring rigor to working with data in a reliable way.

This book is about implementing the various aspects of a big data platform - data ingestion, running analytics and ML, distributing data - in a real-world production system. The focus is on operational aspects like DevOps, monitoring, scale, and compliance. Examples will be provided using Azure services.

There is a big gap between what it takes, for example, to implement an ML model in Python and what it takes to run in a production environment, on a regular basis, with robust guardrails in place. The book focuses on the latter, which makes it different than other data platform books.

The Manning team has a lot of experience putting together books (and selling them). We iterated on the title quite a few times, trying to best capture the essence of the book. Once we started the project, we changed the name from Production Data Engineering with Azure to Practical Data Engineering on Azure.

Before launching the book as a Manning Early Access Preview (MEAP), we changed the name again, this time to Azure Data Engineering: the Practical part of the title made it a bit too long and not very clear.

As the manuscript was wrapping up, we took another look at the title: Azure Data Engineering implies the book is Azure-specific. While all the examples provided are built in the Azure cloud, my hope is the patterns and ideas discussed apply to any big data platform, in any cloud. We iterated on the title again, to emphasize the data engineering part, and ended up with Data Engineering on Azure. This is the final title of the book.

Articles and excerpts

Before starting the project, I wrote a few articles on the topic. The first one was Notes on Data Engineering. Soon after, my team launched the Data Science @ Microsoft Medium publication, where I contributed several articles:

How we built self-serve data environment tools with Azure.
Azure Data Explorer at the Azure business scale.
Running machine learning at scale.
Common data engineering challenges and their solution - which is a retake on that first article (Notes on Data Engineering).
Partnering for data quality.
Partnering for metadata management.
Data distribution.

Most of the ideas from these articles show up in the book. While the articles talk about the specific challenges my team encountered and the solutions we came up with to solve them, the book covers patterns - the general types of problems you would encounter while building a big data platform, and solutions you could apply. The articles helped me to clarify (for myself) the topics I wanted to cover in the book, and refine the proposed solutions.

Once the manuscript was well underway, I wrote a blog post on Data Quality Testing Patterns to clarify my thoughts as I was working on chapter 9 (Data quality), but otherwise switched my focus from articles to getting the book done. At this point, I started publishing excerpts from the book. So far I wrote about Changing data classification through processing and Ingesting data, with more to come.

The speed of the cloud

Innovation in cloud computing moves at a break-neck speed. The technology changes so fast, it is hard to pin things down in written form. For setting up various Azure services, I wanted to rely on command line scripts instead of the Azure Portal UI - walking readers through series of screenshots is tedious, and UI changes all the time. I used Azure CLI instead. That said, many of the extensions I used throughout the book are currently experimental, which means they might change at a future time. I also found a couple of bugs I reported to the teams maintaining the Azure CLI extensions.

Another example of the speed of innovation is Azure Purview. When I started working on our data platform, there was no Azure Purview and my team had to develop a home-grown solution to address our data inventory needs. We then got to use a preview, in-development version of Azure Purview before it was publicly announced (one of the perks of working at Microsoft). Chapter 8 of my book covers metadata management, with the reference implementation on Azure Purview. That meant I wasn't able to start on this chapter until Azure Purview was officially announced, even though I knew what I wanted to write about. Things lined up pretty well, I finished chapter 7 and had to skip to chapter 9, but as I was working on that, Azure Purview went into public preview.

This was a very interesting experience, very different than my previous book. Writing my first book, I didn't feel like there were so many moving parts to get a handle on and the speed with which things changed wasn't overwhelming. Even so, I'm confident the patterns I cover in the book will remain the same for quite some time, regardless of the technologies used to implement them. So even as new services launch and the ways we interact with the cloud evolve, the key takeaways should stay relevant.

Check out my book here: Data Engineering on Azure.