Vlad's Tech Blog

Mental Poker Part 10: Conclusions

Mon, 28 Oct 2024 00:00:00 -0700

Mental Poker Part 10: Conclusions

This blog post will wrap up the Mental Poker series. I started thinking about this in 2021, and worked on a Mental Poker Toolkit library as a side-project. The blog posts in the series were written as I was exploring the tech. Here I aim to bring all the pieces together in a final recap.

Inception

This all started with Fluid Framework. As the team was building out the framework, we used hackathons to implement various applications of Fluid. Since Fluid powers real time collaboration, team members came up with all sorts of ideas. For example, when I joined the team, I built a simple collaborative coloring app where multiple clients can simultaneously color a black and white drawing. A recurring theme was games - building multiplayer games on top of the framework. The challenge with building games is hiding information. In Fluid, all data is synchronized to all clients and there is no central authority. The Azure Fluid Relay isnât running app code, so there isnât an easy way to maintain hidden state for a game (e.g. cards in hand).

I was looking for a way to do this and learned about mental poker. Mental Poker is a way to play games with private information in a zero-trust environment, without relying on a central authority to, for example, deal cards. This is a good fit for Fluid. As a side-project, I decided to build a library to enable development of this type of games that would work with Fluid as the underlying communication mechanism.

So how do players agree on which cards they are dealt, without knowing their opponent's hand?

Cryptography

The first big piece I covered was cryptography. Mental Poker relies on commutative encryption but most commonly used encryption algorithms are non-commutative. Commutative here meaning that if both Alice and Bob encrypt something with their keys, it doesn't matter the order in which they apply their keys to decrypt.

Since I couldn't find a library that provides a symmetric encryption algorithm, I implemented the SRA algorithm (SRA, not RSA - same people's initials, different algorithm). Also ended up implementing a bunch of BigInt math, all covered in Mental Poker Part 1: Cryptography. The blog post covers in detail how shuffling a deck of cards works and what are the cryptography primitives used.

Ledger

Next, looking at game modeling, I decided a good way to represent a turn-based game is an append-only list. Each game step is a node in the list.

Fluid Framework relies on Distributed Data Structures (DDSes) to maintain state and synchronize it across clients. I implemented this ledger as a Fluid Framework DDS here. This is outside of the mental-poker-toolkit repo, since it is generally useful outside of Mental Poker.

The DDS is the lowest-level representation of a game. I covered this in Mental Poker Part 2: Fluid Ledger.

Transport

Wrapping up the plumbing, I looked as a simple abstraction over the transport layer. This is a very simple interface:

// Transport interface
export declare interface ITransport<T> {
    // Get all the actions that have been posted so far
    getActions(): IterableIterator<T>;

    // Post an action
    postAction(value: T): Promise<void>;

    // Event emitter
    once(event: "actionPosted", listener: (value: T) => void): this;
    on(event: "actionPosted", listener: (value: T) => void): this;
    off(event: "actionPosted", listener: (value: T) => void): this;
}

Here, an action is an item on our ledger list. We can get all actions posted to the ledger so far, post a new action, and hook up event listeners.

Note the interface doesn't mention the ledger, so we can swap implementations if needed. The toolkit relies on Fluid (the FluidTransport implementation of this interface) but this could be swapped out for something else as long as this interface is satisfied.

I also implemented a SignedTransport as a decorator, which adds signature verification for an existing ITransport. Since there is no central authority and multiple clients can be part of a session, to mitigate spoofing we want clients to exchange public keys as a first step, then sign all subsequent messages with private keys. This a different algorithm than SRA, regular asymmetric cryptography signing and signature verification. I implemented this on top of crypto.subtle.

I covered all of this in Mental Poker Part 3: Transport.

Actions

I briefly mentioned actions in the Ledger section. For the Mental Poker toolkit, all actions are supposed to contain a clientID property, identifying the client, and a type, which is a string literal, used to identify the action. Plus any additional payload the action might need.

export type ClientId = string;

export type BaseAction = {
    clientId: ClientId;
    type: unknown;
};

Async Queue

The async queue is something I haven't considered when starting the project, but I realized using the ITransport interface is cumbersome. While it maps well over Fluid, using it to implement games is not ergonomic.

The async queue provides a better interface over the transport:

export interface IQueue<T extends BaseAction> {
    enqueue(value: T): Promise<void>;

    dequeue(): Promise<T>;
}

The implementation itself is fairly straightforward, relying on the ITransport APIs and events. With this, clients can enqueue and dequeue actions and await on the response.

Both actions and the queue implementation are covered in Mental Poker Part 4: Actions and Async Queue.

Note that by now, running a game using the toolkit can be done by just relying on actions and the two queue APIs: enqueue() and dequeue(). Very simple.

State Machine

Of course, we need a way to model games. Game rules are implemented as sequences of actions. An action is an atomic step. Note that a game move, for example drawing a card, doesn't necessarily map to a single action going over the transport. A game move, especially in the context of Mental Poker, can involve several steps (actions) taken by the players.

The state machine aims to facilitate game implementation.

Transitions

I implemented two core state machine pieces: local transitions and remote transitions.

A local transition means an action originates on our client. For example the player decides to discard a card or, in a game of rock-paper-scissors, the player picks between the 3 options. This means we will run some code and enqueue an action:

type LocalTransition<TAction extends BaseAction, TContext> = (
    actionQueue: IQueue<TAction>,
    context: TContext
) => void | Promise<void>;

We take the queue as a parameter. The context can be anything, it's a way to pass additional game state to the function.

A remote transition means we receive an action.

type Transition<TAction extends BaseAction, TContext> = (
    action: TAction,
    context: TContext
) => void | Promise<void>;

Here, we dequeue an action and invoke the transition, passing the action as an argument.

We need both of these transitions to implement a game, but we can provide a unified abstraction:

type RunnableTransition<TContext> = {
    actionQueue: IQueue<BaseAction>,
    context: TContext
}: Promise<void>;

We can adapt a Transition to this type by calling dequeue on the actionQueue and passing the resulting action to the Transition.

The state machine takes an array of RunnableTransitions and executes the code in sequence. It also provides several helper functions:

local(), to create RunnableTransition from a LocalTransition.
transition(), to create a RunnableTransition from a (remote) Transition.
repeat(), to repeat a given RunnableTransition a number of times.
transitions(), to convert several RunnableTransition or RunnableTransition[] into a flat array of RunnableTransition.

The post Mental Poker Part 5: State Machine covers the implementation in details and also shows examples of modeling rules as transitions. Here's a rock-paper-scissors skeleton:

sm.sequence([
    sm.local(async (queue, context) => {
        // Post our play action
    }),
    sm.repeat(sm.transition(async (action, context) => {
        // Both player and opponent need to post their encrypted selection
    }), 2),
    sm.local(async (queue, context) => {
        // Post our reveal action
    }),
    sm.repeat(sm.transition(async (reveal: RevealAction, context: RootStore) => {
        // Both player and opponent need to reveal their selection
    }), 2)
]);

Primitives

We now have all the pieces we need to model games. The toolkit also provides common primitives - plug & play state machines to be integrated in games.

An example of this is card shuffling. Given a deck of cards, there is a state machine that shuffles this deck according to the Mental Poker steps and hides this behind a simple shuffle() function.

I cover the details of this in Mental Poker Part 6: Shuffling Implementation.

Shuffling cards is the canonical example of Mental Poker, but building a game requires several other common pieces. A few examples:

Creating a Fluid transport (abstracting the Fluid container and connection setup).
Enabling signature checking, in other words converting a given (unsigned) ITransport into a SignedTransport.
Establishing turn order for multiple players and agreeing on a large shared prime (required by RSA).

I covered all of these in Mental Poker Part 7: Primitives.

All implementation rely on the state machine are expressed as sequences of transitions.

Games

Finally, I provided a couple of sample games.

The first is rock-paper-scissors. Rock-paper-scissors is interesting because it does require some cryptography, but it is much simpler than a card game. Players simply pick between rock, paper, or scissors, encrypt their choice, then post it (enqueue it). Once both players shared their pick, they share a key the other player can use to decrypt their pick. Then we can see who won the game.

The implementation is covered in Mental Poker Part 8: Rock-Paper-Scissors.

Next, I implemented a more complex game: discard. In this game, players take turns discarding cards as long as they can match the value or suit on top of the discard pile. If they can't discard, they draw a card instead. The first player to discard their whole hand wins. This is again a fairly simple game in terms of rules, but requires more advance semantics like card shuffling, drawing and discarding cards etc.

The implementation is covered in Mental Poker Part 9: Discard Game.

Zero-Trust

Mental Poker enables us to play games in a zero-trust environment without a centralized authority. Of course, there are some limitations.

Signature verification mitigates spoofing, but there is no way to guarantee other clients aren't colluding over a secondary channel. This isn't a limitation of Mental Poker, rather in general - even if we play poker with a server handling the deal, players can cheat and talk to each other with a separate app.

Cryptography ensures certain type of cheating is impossible. For example in the rock-paper-scissors example, a player can't pretend they picked something else once their encrypted pick was shared with the other player. Similarly, cryptography enables maintaining private state over a public channel, including card shuffling, cards in hand etc.

The state machine helps model games as a sequence of steps. As long as the clients agree on the rules and follow the steps, they can play a game. Once a player posts an action that the other player doesn't expect, in other words is not correct according to the game semantics, the other player can tell the game rules are not respected. That said, there is no simple way to recover from this. I call this the flip the table recourse. You can't really do much, since there's no central authority to arbitrate this, but cryptography and the state machine make it easy for you to tell when another player is cheating and, at the very least, you can refuse to continue playing.

This was a very fun side-project I worked on, intermittently, for 3 years. I learned a lot about Mental Poker and built a reusable toolkit for this type of games. All code discussed in the series is available on GitHub: https://github.com/vladris/mental-poker-toolkit/.

Mental Poker Part 9: Discard Game

Thu, 18 Jul 2024 00:00:00 -0700

Mental Poker Part 9: Discard Game

For an overview on Mental Poker, see Mental Poker Part 0: An Overview. Other articles in this series:Â https://vladris.com/writings/index.html#mental-poker. In the previous post in the series we looked at building a simple game of rock-paper-scissors. In this post we'll look at implementing a card game.

Overview

We'll build a discard game - players take turns discarding a card that must match either the suit or the value of the card on top of the discard pile. The player who discards their whole hand first wins.

We're implementing a simple game as the focus is not on the game-specific logic, rather how to leverage the Mental Poker toolkit.

The full code for this is in the demos/discard app. The best way to read this post is side by side with the code.

We'll follow a similar structure to the rock-paper-scissors game we looked at in the previous post:

A model implementing the game logic.
A Redux store maintaining game state.
A React UI bound to the store.

Model

First, let's look at how we implement the deck of cards and associated logic.

Deck

We'll represent a card as a string, for example "9:hearts" is the 9 of hearts. The function getDeck() initializes as unshuffled deck of cards:

function getDeck() {
    const deck: string[] = [];

    for (const value of ["9", "10", "J", "Q", "K", "A"]) {
        for (const suit of ["hearts", "diamonds", "clubs", "spades"]) {
            deck.push(value + ":" + suit);
        }
    }

    return deck;
}

We're using fewer cards (from 9 to Aces) for this demo as the more cards we have the more prime numbers we need to find to encrypt them and it slows things down. Rather than implementing some loading UI, we'll just use fewer cards for the example.

We need a helper function to tell us whether two cards match either value or suit:

function matchSuitOrValue(a: string, b: string) {
    const [aValue, aSuit] = a.split(":");
    const [bValue, bSuit] = b.split(":");

    return aValue === bValue || aSuit === bSuit;
}

Finally, we want a class to wrap a deck and implement the functions needed for using it:

class Deck {
    private myCards: number[] = [];
    private othersCards: number[] = [];
    private drawPile: number[] = [];
    private discardPile: number[] = [];

    private decryptedCards: (string | undefined)[] = [];
    private othersKeys: SRAKeyPair[] = [];

    constructor(
        private encryptedCards: string[],
        private myKeys: SRAKeyPair[],
        private store: RootStore
    ) {
        this.drawPile = encryptedCards.map((_, i) => i);
    }
...

We initialize the class with an array of encrypted cards (the shuffled deck) as encryptedCards, our set of SRA keys (myKeys) and the Redux store (store).

We also need to track cards (by index):

The cards in our hand (myCards).
The cards in the other player's hand (othersCards).
The draw pile (drawPile).
The discard pile (discardPile).

As the other player shares their encryption keys (when they reveal a card to us), we'll store them in the othersKeys array. Similarly, as we decrypt cards, we'll store them in decryptedCards - this is just for convenience, so we don't have to keep decrypting the same values over and over.

We assume we're starting with a shuffled deck of cards as a draw pile, with no player having cards in hand - so we initialize drawPile to the indexes of encryptedCards.

Some helper functions:

...
    
    getKey(index: number) {
        return SRAKeySerializationHelper.serializeSRAKeyPair(
            this.myKeys[index]
        );
    }

    getKeyFromHand(index: number) {
        return SRAKeySerializationHelper.serializeSRAKeyPair(
            this.myKeys[this.myCards[index]]
        );
    }

    cardAt(index: number) {
        if (!this.decryptedCards[index]) {
            const partial = SRA.decryptString(
                this.encryptedCards[index],
                this.myKeys[index]
            );

            this.decryptedCards[index] = SRA.decryptString(
                partial,
                this.othersKeys[index]
            );
        }

        return this.decryptedCards[index]!;
    }

    getDrawIndex() {
        return this.drawPile[0];
    }
    
    canIMove() {
        if (this.discardPile.length === 0) {
            return true;
        }

        return (
            this.drawPile.length > 0 ||
            this.myCards.some((index) =>
                matchSuitOrValue(
                    this.cardAt(index),
                    this.cardAt(this.discardPile[this.discardPile.length - 1])
                )
            )
        );
    }
    ...

These are pretty self-explanatory:

getKey() returns our SRA key at index.
getKeyFromHand() returns our SRA key for a card in our hand (at index).
cardAt() returns the decrypted card at index. This assumes we can decrypt the card. If we are already storing it in decryptedCards, we return it from there; otherwise we decrypt it using our key and the other player's key, then store it in decryptedCards.
getDrawIndex() returns the index at the top of the discard pile.
canIMove() returns true if we can discard a card. If the discard pile is empty, we can discard anything; otherwise at least one of the cards in our hand needs to match the suit or value of the card on top of the discard pile.

We also need to implement some functions that mutate the deck (in which case we also need to update our view-model so our UI reflects the changes):

...
    async myDraw(serializedSRAKeyPair: SerializedSRAKeyPair) {
        const index = this.drawPile.shift()!;
        this.myCards.push(index);
        this.othersKeys[index] =
            SRAKeySerializationHelper.deserializeSRAKeyPair(
                serializedSRAKeyPair
            );

        await this.updateViewModel();
    }

    async othersDraw() {
        this.othersCards.push(this.drawPile.shift()!);

        await this.updateViewModel();
    }

    async myDiscard(index: number) {
        const cardIndex = this.myCards.splice(index, 1)[0];
        this.discardPile.push(cardIndex);

        this.updateViewModel();
    }

    async othersDiscard(
        index: number,
        serializedSRAKeyPair: SerializedSRAKeyPair
    ) {
        const cardIndex = this.othersCards.splice(index, 1)[0];
        this.othersKeys[cardIndex] =
            SRAKeySerializationHelper.deserializeSRAKeyPair(
                serializedSRAKeyPair
            );
        this.discardPile.push(cardIndex);

        this.updateViewModel();
    }
    ...

The actions are:

myDraw() - we draw a card from the top of the draw pile. We need the other player's key for this card, given as the serializedSRAKeyPair argument.
othersDraw() - other player draws a card from the top of the draw pile. Note the Deck class just maintains state, so is not responsible for sharing our key for that card with the other player - rather we just update the state (othersCards and drawPile).
myDiscard() - we discard a card. We take the index of the card as an argument.
othersDiscard() - other player discards a card. We take the index of the card and the other player's SRA key as arguments.

Note all these functions end up calling updateViewModel(). That's because all of the functions change state, so we need to update our Redux store and reflect the changes on the UI:

...
    private async updateViewModel() {
        await this.store.dispatch(
            updateDeckViewModel({
                drawPile: this.drawPile.length,
                discardPile: this.discardPile.map((i) => this.cardAt(i)),
                myCards: this.myCards.map((i) => this.cardAt(i)),
                othersHand: this.othersCards.length,
            })
        );
    }
}

We haven't looked at the Redux store yet. We'll cover this later on but here we dispatch a deck view-model update. The deck view-model contains the size of the draw pile, the cards in the discard pile and our hand, and the number of cards in the other player's hand.

type DeckViewModel = {
    drawPile: number;
    discardPile: string[];
    myCards: string[];
    othersHand: number;
};

const defaultDeckViewModel: DeckViewModel = {
    drawPile: 0,
    discardPile: [],
    myCards: [],
    othersHand: 0,
};

These is all the deck management logic we need. Let's move on to game actions.

Dealing

We'll be using the library-provided shuffle. We covered this in part 6 so we won't go over it again. This is exposed by as a shuffle() function. So assuming our deck is shuffled, the first action we need to handle is dealing cards. In Mental Poker, dealing a card to Bob means Alice needs to share her key to that card. Then Bob can use his key and Alice's key to see the card, while Alice cannot see it since she doesn't have Bob's key. This is the equivalent of Bob holding a card in his hand.

We define a DealAction:

type DealAction = {
    clientId: ClientId;
    type: "DealAction";
    cards: number[];
    keys: SerializedSRAKeyPair[];
}

Here, cards are the indexes of the cards in the deck and keys are the corresponding SRA keys for each card. Here's the state machine for dealing cards to both players:

async function deal(imFirst: boolean, count: number) {
    const queue = store.getState().queue.value!;

    await store.dispatch(updateGameStatus("Dealing"));

    const cards = new Array(count).fill(0).map((_, i) => imFirst ? i + count : i);
    const keys = cards.map((card) => store.getState().deck.value!.getKey(card)!);

    await sm.run(sm.sequence([
        sm.local(async (queue: IQueue<Action>, context: RootStore) => {
            await queue.enqueue({ 
                clientId: context.getState().id.value, 
                type: "DealAction",
                cards,
                keys });
        }),
        sm.repeat(sm.transition(async (action: DealAction, context: RootStore) => {
            if (action.type !== "DealAction") {
                throw new Error("Invalid action type");
            }

            if (action.clientId === context.getState().id.value) {
                return;
            }

            const deck = context.getState().deck.value!;

            for (let i = 0; i < action.cards.length; i++) {
                if (imFirst) {
                    if (action.cards[i] !== i) {
                        throw new Error("Unexpected card index");
                    }
                    await deck.myDraw(action.keys[i]);
                } else {
                    await deck.othersDraw();
                }
            }

            for (let i = 0; i < action.cards.length; i++) {
                if (imFirst) {
                    await deck.othersDraw();
                } else {
                    if (action.cards[i] !== i + action.cards.length) {
                        throw new Error("Unexpected card index");
                    }
                    await deck.myDraw(action.keys[i]);
                }
            }
        }), 2)
    ]), queue, store);
}

In preparation of dealing, we:

Get the async queue from the store.
We update the game status to Dealing (more details on this later).
We determine which cards the other player needs - if we're fist, we get the first count cards so the other player will get the next count ones; otherwise they get the first count cards and we get the next ones.
We also get the set of keys we need to share with the other player so they can decrypt the cards they are dealt.

With this done, our state machine consists of:

A local transition: we enqueue a DealAction with the cards and keys we determined the other player gets.
A remote transition, repeated twice: we expect to see two DealAction actions. If we see the one we sent out (the clientId matches our clientId) we can ignore it. If we see the DealAction from the other player, we update the deck. If we are first to draw, then we call deck.myDraw() count times, then deck.othersDraw() count times; otherwise we do it the other way around - call deck.othersDraw() count times, then call deck.myDraw() count times.

Local transitions and remote transitions are explained in part 5, in which we talked about the state machine.

Drawing cards

Drawing a card is a two-step process. We need to tell the other player we intend to draw a card (from the draw pile), and they need to give us their key to that card. Similarly, if the other player tells us they want to draw a card, we give them our key to that card.

We need two actions:

type DrawRequestAction = {
    clientId: ClientId;
    type: "DrawRequest";
    cardIndex: number;
}

type DrawResponseAction = {
    clientId: ClientId;
    type: "DrawResponse";
    cardIndex: number;
    key: SerializedSRAKeyPair;
}

If we want to draw a card, here is our state machine:

async function drawCard() {
    const queue = store.getState().queue.value!;

    await store.dispatch(updateGameStatus("Waiting"));

    await sm.run([
        sm.local(async (queue: IQueue<Action>, context: RootStore) => {
            await queue.enqueue({ 
                clientId: context.getState().id.value, 
                type: "DrawRequest",
                cardIndex: context.getState().deck.value!.getDrawIndex() });
        }),
        sm.transition((action: DrawRequestAction) => {
            if (action.type !== "DrawRequest") {
                throw new Error("Invalid action type");
            }
        }),
        sm.transition(async (action: DrawResponseAction, context: RootStore) => {
            if (action.type !== "DrawResponse") {
                throw new Error("Invalid action type");
            }

            await context.getState().deck.value!.myDraw(action.key);
        }),
    ], queue, store);

    await store.dispatch(updateGameStatus("OthersTurn"));
    await waitForOpponent();
}

We again get the async queue from the store and update the game status. Then we run the state machine consisting of 3 transitions:

A local transition in which we post our DrawRequest action.
A remote transition in which we expect to see our DrawRequest.
A remote transition in which we expect the other player to respond with a DrawResponse action, giving us the key and allowing us to draw a card.

Finally, after running the state machine and drawing the card, we update the game status again to other player's turn and call waitForOpponent(), which we'll cover later.

This fully implements us drawing a card from the top of the discard pile and updating the deck.

Discarding cards

Similar to drawing cards, we need to implement discarding cards. Discarding a card is easier - we don't need a key from the other player, rather we just provide the key to the card we're discarding such that the other player can see it.

type DiscardRequestAction = {
    clientId: ClientId;
    type: "DiscardRequest";
    cardIndex: number;
    key: SerializedSRAKeyPair;
}

Our DiscardRequestAction contains the card index and our key.

The corresponding state machine:

async function discardCard(index: number) {
    const queue = store.getState().queue.value!;

    await store.dispatch(updateGameStatus("Waiting"));

    await sm.run([
        sm.local(async (queue: IQueue<Action>, context: RootStore) => {
            await queue.enqueue({
                clientId: context.getState().id.value, 
                type: "DiscardRequest",
                cardIndex: index,
                key: context.getState().deck.value!.getKeyFromHand(index)});
        }),
        sm.transition(async (action: DiscardRequestAction, context: RootStore) => {
            if (action.type !== "DiscardRequest") {
                throw new Error("Invalid action type");
            }

            await context.getState().deck.value!.myDiscard(action.cardIndex);
        }),
    ], queue, store);

    if (store.getState().deckViewModel.value.myCards.length === 0) {
        await store.dispatch(updateGameStatus("Win"));
    } else {
        await store.dispatch(updateGameStatus("OthersTurn"));
        await waitForOpponent();
    }
}

As usual, we get the queue and update game state. Then we run the state machine:

A local transition posts a DiscardRequest with the card index and key.
A remote transition in which we should see our own DiscardRequest - since this round-tripped, we can now update the deck.

After running the state machine, we need to check whether we discarded the last card in our hand. If we did, we can update the game state to us winning. Otherwise we wait for the other player's move.

Can't move

The last action we need to look at is the situation in which we can't discard any card (no matching suit or value) and we also can't draw a card (draw pile is empty). In this case we lose the game. Since it is our turn, we need to let the other player know that we're not pondering our next move, rather that we can't do anything and we lose. We'll model this as a simple CantMoveAction:

type CantMoveAction = {
    clientId: ClientId;
    type: "CantMove";
}

This action has no payload. The state machine is also very simple:

async function cantMove() {
    const queue = store.getState().queue.value!;

    await queue.enqueue({ 
        clientId: store.getState().id.value, 
        type: "CantMove" });

    await store.dispatch(updateGameStatus("Loss"));
}

At the end of it, we update the game status to us losing.

So far, we have the 3 possible actions we can take when it is our turn:

Draw a card (via drawCard()).
Discard a card (via discardCard()).
Can't draw, can't discard (via cantMove()).

Next, we need to model responding to the other player's move.

Opponent's turn

The opponent can take the same actions as we can, so we don't need to declare any new action types, rather we need a state machine that responds to actions incoming from the other player:

async function waitForOpponent() {
    const queue = store.getState().queue.value!;

    const othersAction = await queue.dequeue();

    switch (othersAction.type) {
        case "DrawRequest":
            await sm.run([
                sm.local(async (queue: IQueue<Action>, context: RootStore) => {
                    if (othersAction.cardIndex !== store.getState().deck.value!.getDrawIndex()) {
                        throw new Error("Invalid card index for draw");
                    }

                    await queue.enqueue({
                        clientId: store.getState().id.value,
                        type: "DrawResponse",
                        cardIndex: othersAction.cardIndex,
                        key: store.getState().deck.value!.getKey(othersAction.cardIndex)
                    })}),
                sm.transition(async (action: DrawResponseAction, context: RootStore) => {
                    if (action.type !== "DrawResponse") {
                        throw new Error("Invalid action type");
                    }

                    await context.getState().deck.value!.othersDraw();
                })], queue, store);
            await store.dispatch(updateGameStatus("MyTurn"));
            break;
        case "DiscardRequest":
            await store.getState().deck.value!.othersDiscard(othersAction.cardIndex, othersAction.key);

            if (store.getState().deckViewModel.value.othersHand === 0) {
                await store.dispatch(updateGameStatus("Loss"));
            } else if (store.getState().deck.value?.canIMove()) {
                await store.dispatch(updateGameStatus("MyTurn"));
            } else {
                await cantMove();
            }

            break;
        case "CantMove":
            await store.dispatch(updateGameStatus("Win"));
            break;
        }
}

We dequeue an action, then we respond based on its type:

If this is a DrawRequest, we send a DrawResponse. We implement this as a simple state machine with a local transition (our DrawResponse) and a remote transition in which we expect to see our response round-tripped. We also check to ensure the draw request card index matches the top of the draw pile (otherwise the other player might trick us and draw some other card).
If this is a DiscardRequest, we update the deck. If the other player discarded their last card, we lose. Otherwise, if we can move, we update game status to MyTurn and let the user pick which card to discard etc. But if we can't move - can't discard anything, can't draw, then we automatically call cantMove() to mark the fact we lost.
If this is a CantMove, the other player lost so we update game status to Win.

Note for the discard request, to keep things simple, we aren't checking whether the move is legal. If we want to secure the implementation, we should check that the card the other player is discarding matches either the suit or value of the card on top of the discard pile.

Actions and status

We already covered all possible actions:

type Action = DealAction | DrawRequestAction | DrawResponseAction | DiscardRequestAction | CantMoveAction;

The possible game status:

type GameStatus = "Waiting" | "Shuffling" | "Dealing" | "MyTurn" | "OthersTurn" | "Win" | "Loss" | "Draw";

We just implemented all the game logic - the possible actions a player can take, and the request/response needed to model the game of discard. We have the full model, so let's move on to the Redux store.

Store

Like in the previous post, we will be using Redux and the Redux Toolkit.

The sate we'll be maintaining:

Our ID.
The other player's ID.
The Mental Poker async queue we implement the game on top of.
The game status (GameStatus in our model).
The deck (represented by an instance of Deck).
The deck view-model (providing just enough data to bind to the UI).

Using createAction from the Redux Toolkit:

const updateId = createAction<string>("id/update");
const updateOtherPlayer = createAction<string>("otherPlayer/update");
const updateQueue = createAction<IQueue<Action>>("queue/update");
const updateGameStatus = createAction<GameStatus>("gameStatus/update");
const updateDeck = createAction<Deck>("deck/update");
const updateDeckViewModel = createAction<DeckViewModel>("deckViewModel/update");

We'll also use the same helper to create Redux reducers as for rock-paper-scissors:

function makeUpdateReducer<T>(
    initialValue: T,
    updateAction: ReturnType<typeof createAction>
) {
    return createReducer({ value: initialValue }, (builder) => {
        builder.addCase(updateAction, (state, action) => {
            state.value = action.payload;
        });
    });
}

Our Redux store is:

const store = configureStore({
    reducer: {
        id: makeUpdateReducer("", updateId),
        otherPlayer: makeUpdateReducer("Not joined", updateOtherPlayer),
        queue: makeUpdateReducer<IQueue<Action> | undefined>(
            undefined,
            updateQueue
        ),
        gameStatus: makeUpdateReducer("Waiting", updateGameStatus),
        deck: makeUpdateReducer<Deck | undefined>(undefined, updateDeck),
        deckViewModel: makeUpdateReducer<DeckViewModel>(defaultDeckViewModel, updateDeckViewModel),
    },
    middleware: (getDefaultMiddleware) =>
        getDefaultMiddleware({
            serializableCheck: false,
        }),
});

This is all we need to connect the model with the view.

UI

We'll use React.

Card

The first component we need is a card:

type CardViewProps = {
    card: string | undefined;
    onClick?: () => void;
};

const suiteMap = new Map([
    ["hearts", "â¥"],
    ["diamonds", "â¦"],
    ["clubs", "â£"],
    ["spades", "â "]
]);

const CardView: React.FC<CardViewProps> = ({ card, onClick }) => {
    const number = card?.split(":")[0];
    const suite = card ? suiteMap.get(card.split(":")[1]) : undefined;
    const color = suite === "â¥" || suite === "â¦" ? "red" : "black";

    return <div style={{ width: 70, height: 100, borderColor: "black", borderWidth: 1, borderStyle: "solid", borderRadius: 5, 
                    backgroundColor: card ? "white" : "darkred"}} onClick={onClick}>
        <div style={{ display: card ? "block" : "none", paddingLeft: 15, paddingRight: 15, color }}>
            <p style={{ marginTop: 20, marginBottom: 0, textAlign: "left", fontSize: 25 }}>{number}</p>
            <p style={{ marginTop: 0, textAlign: "right", fontSize: 30 }}>{suite}</p>
        </div>
    </div>
}

This renders a card which can be a string or undefined. If it is a string, we render the value and suit. Otherwise we render the back of the card - a dark red rectangle. Cards have an optional onClick() event.

Hand

A HandView renders several cards:

type HandViewProps = {
    prefix: string;
    cards: (string | undefined)[];
    onClick?: (index: number) => void;
};

const HandView: React.FC<HandViewProps> = ({ cards, prefix, onClick }) => {
    return <div style={{ display: "flex", flexDirection: "row", justifyContent: "center" }}>{
            cards.map((card, i) => <CardView key={prefix + ":" + i} card={ card } onClick={() => { if (onClick) { onClick(i) } }} />)
        }
    </div>
}

This can be the player's hand, where we should have string values for each card and an onClick() event hooked up for when the player clicks on a card to discard it. It can also be the other player's hand, in which case we should have undefined values for each card and just show their backs.

Table

MainView implements a view of the whole table:

const useSelector: TypedUseSelectorHook<RootState> = useReduxSelector;

const MainView = () => {
    const idSelector = useSelector((state) => state.id);
    const otherPlayer = useSelector((state) => state.otherPlayer);
    const gameStateSelector = useSelector((state) => state.gameStatus);
    const deckViewModel = useSelector((state) => state.deckViewModel);

    const myTurn = gameStateSelector.value === "MyTurn";

    const canDiscard = (index: number) => {
        if (deckViewModel.value.discardPile.length === 0) {
            return true;
        }

        return matchSuitOrValue(
            deckViewModel.value.myCards[index],
            deckViewModel.value.discardPile[deckViewModel.value.discardPile.length - 1]);
    }

    return <div>
        <div>
            <p>Id: {idSelector.value}</p>
            <p>Other player: {otherPlayer.value}</p>
            <p>Status: {gameStateSelector.value}</p>
        </div>
        <div style={{ height: 200, textAlign: "center" }}>
            <HandView prefix={"others"} cards={ new Array(deckViewModel.value.othersHand).fill(undefined) } />
        </div>
        <div style={{ height: 200, display: "flex", flexDirection: "row", justifyContent: "center" }}>
            <div style={{ display: deckViewModel.value.drawPile > 0 ? "block" : "none", margin: 5 }} onClick={() => { if (myTurn) { drawCard()} }}>
                <span>{deckViewModel.value.drawPile} card{deckViewModel.value.drawPile !== 1 ? "s" : ""}</span>
                <CardView card={ undefined } />
            </div>
            <div style={{ display: deckViewModel.value.discardPile.length > 0 ? "block" : "none", margin: 5 }}>
                <span>{deckViewModel.value.discardPile.length} card{deckViewModel.value.discardPile.length !== 1 ? "s" : ""}</span>
                <CardView card={ deckViewModel.value.discardPile[deckViewModel.value.discardPile.length - 1] } />
            </div>
        </div>
        <div style={{ height: 200, textAlign: "center" }}>
            <HandView
                prefix={"mine"}
                cards={ deckViewModel.value.myCards }
                onClick={(index) => { if (myTurn && canDiscard(index)) { discardCard(index) } }} />
        </div>
    </div>

This consists of:

A top display showing our ID, the other player's ID, and the game status.
The other player's hand (we'll only see the back of the cards).
The draw pile - if there's no more cards in the draw pile, we don't show anything; otherwise we show the back of a card and the number of cards in the pile.
The discard pile - if nothing discarded yet, we don't show anything; otherwise we show the card on top of the discard pile and the number of cards in the pile.
Our hand.

If it is our turn, we hook up drawCard() to the draw pile's onClick() and for each card we can discard, we hook up discardCard() to the card's onClick().

And that's it. Rendering it all on the page:

const root = ReactDOM.createRoot(document.getElementById("root")!);
root.render(
    <Provider store={store}>
        <MainView />
    </Provider>
);

Here,Â ProviderÂ comes from theÂ react-reduxÂ package and makes the Redux store available to the React components.

Initialization

Like with rock-paper-scissors, let's look at how we initialize the game:

getLedger<Action>().then(async (ledger) => {
    const id = randomClientId();

    await store.dispatch(updateId(id));

    const queue = await upgradeTransport(2, id, ledger);

    await store.dispatch(updateQueue(queue));

    for (const action of ledger.getActions()) {
        if (action.clientId !== id) {
            await store.dispatch(updateOtherPlayer(action.clientId));
            break;
        }
    }

    const [sharedPrime, turnOrder] = await establishTurnOrder(2, id, queue);

    await store.dispatch(updateGameStatus("Shuffling"));

    const [keys, deck] = await shuffle(id, turnOrder, sharedPrime, getDeck(), queue, 64);
 
    const imFirst = turnOrder[0] === id;

    await store.dispatch(updateDeck(new Deck(deck, keys, store)));

    await deal(imFirst, 5);

    await store.dispatch(updateGameStatus(imFirst ? "MyTurn" : "OthersTurn"));

    if (!imFirst) {
        await waitForOpponent();
    }
});

We connect to the Fluid session and get a reference to theÂ ledger, as we saw inÂ part 7.
We generate a random client ID (using the implementation inÂ packages/primitives/src/randomClientId.ts).
Update our ID in the Redux store.
We callÂ upgradeTransport()Â (also discussed in part 7).
Update the Redux store with a reference to the async queue.
We retrieve and store the other playerâs ID.
We get the shared prime and establish turn order (also covered in part 7).
We update the game status to Shuffling.
We shuffle the deck using the shuffle() primitive and get back our keys and encrypted cards.
Determine whether we are first (based on established turn order) and store this in imFirst.
We instantiate a Deck and store in the Redux store.
Deal 5 cards to each player using deal().
Update state again, based on whether we are first or not to MyTurn or OthersTurn.
If we're not first to play, call waitForOpponent().

This initialization is a bit longer than the one for rock-paper-scissors, since we have to shuffle and deal cards, and the order in which the players go is important.

Summary

We looked at implementing a discard card game using the Mental Poker toolkit. The full source code for the demo is underÂ demos/discard.

Instructions on how to run the game inÂ README.md.
Deck management is implemented in deck.ts.
The rest of the model is implemented in model.ts.
The Redux store is implemented in store.ts.
The React components are here: cardView.tsx, handView.tsx, mainView.tsx.
Initialization happens inÂ index.tsx.

We finally put the whole toolkit to its intended use and built an end-to-end interactive, 2-player card game.

Mental Poker Part 8: Rock-Paper-Scissors

Mon, 24 Jun 2024 00:00:00 -0700

Mental Poker Part 8: Rock-Paper-Scissors

For an overview on Mental Poker, seeÂ Mental Poker Part 0: An Overview. Other articles in this series:Â https://vladris.com/writings/index.html#mental-poker. In the previousÂ post in the seriesÂ we looked at some low-level building blocks. It this post, weâll finally see how to implement a game end-to-end using the toolkit. Weâll start with a simple game: rock-paper-scissors.

Overview

Weâll build this game as a React app, using the toolkit. Weâll be using Redux for state management - Redux provides a good way of binding game state to the UI, which works well with our toolkit.

The full code for this is in the demos/rock-paper-scissors app.

Model

Since we got a lot of the primitives out of the way in the previous post (Fluid connection, getting a SignedTransport etc.), in this post we can focus on the higher level semantics of modeling the game.

Weâll play a round of rock-paper-scissors as follows:

Both players post their selection (rock or paper or scissors) encrypted.
Both players reveal their decryption key.

This 2-step protects against cheating: before the game proceeds, both players need to make a selection. But the other player doesnât know what the selection is until the decryption key is provided. Note for this particular game, turn order doesnât matter.

Weâll start with a few type definitions:

type PlaySelection = "Rock" | "Paper" | "Scissors";

type EncryptedSelection = string;

PlaySelection represents the possible plays, EncryptedSelection is the string representation of an encrypted PlaySelection.

Our game model will have 2 actions:

type PlayAction = {
    clientId: ClientId;
    type: "PlayAction";
    encryptedSelection: EncryptedSelection;
};

type RevealAction = {
    clientId: ClientId;
    type: "RevealAction";
    key: SerializedSRAKeyPair;
};

type Action = PlayAction | RevealAction;

PlayAction is the first step, when players post their encrypted choice. RevealAction is the second step, revealing the encryption key. Weâll use the SRA algorithm for encryption since we have it in our toolkit, but for this game any encryption algorithm would work.

Weâll also need a couple more type definitions for the game state:

type GameStatus = "Waiting" | "Ready" | "Win" | "Loss" | "Draw";

type PlayValue =
    | { type: "Selection"; value: PlaySelection }
    | { type: "Encrypted"; value: EncryptedSelection }
    | { type: "None"; value: undefined };

The GameStatus represents the different states the client can be in:

Waiting for another player to connect or for round to finish.
Ready to play.
Win, Loss, Draw - the result after playing a round.

The PlayValue represents the current state of a playerâs pick. It can be either an encrypted selection, a revealed selection, or nothing (at the start of the game).

Before implementing the game state machine, letâs look at the Redux store.

Store

I wonât go into the details of Redux in this post - please refer to the Redux documentation for that. Weâll be using the Redux Toolkit to streamline setting up our store.

We will maintain 6 pieces of state:

Our ID.
The other playerâs ID.
The Mental Poker async queue we implement the game on top of.
The game status (GameStatus above).
Our play (PlayValue above).
The other playerâs play (also a PlayValue).

Weâll use the Redux Toolkit createAction helper to define the update functions for these:

const updateId = createAction<string>("id/update");
const updateOtherPlayer = createAction<string>("otherPlayer/update");
const updateQueue = createAction<IQueue<Action>>("queue/update");
const updateGameStatus = createAction<GameStatus>("gameStatus/update");
const updateMyPlay = createAction<PlayValue>("myPlay/update");
const updateTheirPlay = createAction<PlayValue>("theirPlay/update");

Weâll also need reducers (another Redux concept) for updating the values. We can implement a helper function to create these:

function makeUpdateReducer<T>(
    initialValue: T,
    updateAction: ReturnType<typeof createAction>
) {
    return createReducer({ value: initialValue }, (builder) => {
        builder.addCase(updateAction, (state, action) => {
            state.value = action.payload;
        });
    });
}

Finally, we set up our Redux store as:

const store = configureStore({
    reducer: {
        id: makeUpdateReducer("", updateId),
        otherPlayer: makeUpdateReducer("Not joined", updateOtherPlayer),
        queue: makeUpdateReducer<IQueue<Action> | undefined>(
            undefined,
            updateQueue
        ),
        myPlay: makeUpdateReducer<PlayValue>(
            { type: "None", value: undefined },
            updateMyPlay
        ),
        theirPlay: makeUpdateReducer<PlayValue>(
            { type: "None", value: undefined },
            updateTheirPlay
        ),
        gameStatus: makeUpdateReducer("Waiting", updateGameStatus),
    },
    middleware: (getDefaultMiddleware) =>
        getDefaultMiddleware({
            serializableCheck: false,
        }),
});

We initialize the store with the default values:

We donât have an ID.
The other player hasnât joined yet.
We donât have an async queue.
Neither player has any play.
The game state is Waiting (for other player to connect).

Thatâs about it for Redux setup - again, I wonât cover what reducers are, how Redux manages state changes etc.

Playing a round

Weâll implement playing a round of rock-paper-scissors in the function async function playRound(selection: PlaySelection). We invoke this with our selection (rock, paper, or scissors).

First, we need to get a few references:

const context = store;

await context.dispatch(updateGameStatus("Waiting"));

const queue = context.getState().queue.value!;

const kp = SRA.genereateKeyPair(BigIntUtils.randPrime());

First, we get a reference to the Redux store. Then we update the game status to Waiting. We get a reference to the async queue from the Redux store and, finally, we generate an SRA key pair. The generateKeyPair() and randPrime() functions we discussed all the way in part 1, when we covered cryptography. The dispatch() and getState() are standard Redux calls.

Now letâs look at the state machine modeling a round. It consists of the following sequence:

Post our encrypted selection.
Expect to receive 2 encrypted selections (ours and the opponentâs).
Post our encryption key to reveal our selection.
Expect to receive 2 encryption keys (ours and the opponentâs).

We can run this state machine with the Redux store as context:

await sm.run(sm.sequence([
        sm.local(async (queue) => {
            const playAction = {
                clientId: context.getState().id.value,
                type: "PlayAction",
                encryptedSelection: SRA.encryptString(selection, kp),
            };

            await queue.enqueue(playAction);
        }),
        sm.repeat(sm.transition(async (play: PlayAction, context: RootStore) => {
            const action =
            play.clientId === context.getState().id.value
                ? updateMyPlay
                : updateTheirPlay;

            await context.dispatch(
                action({ type: "Encrypted", value: play.encryptedSelection })
        );
        }), 2),
        sm.local(async (queue) => {
            const revealAction = {
                clientId: context.getState().id.value,
                type: "RevealAction",
                key: SRAKeySerializationHelper.serializeSRAKeyPair(kp),
            };
            
            await queue.enqueue(revealAction);
        }),
        sm.repeat(sm.transition(async (reveal: RevealAction, context: RootStore) => {
            const action =
                reveal.clientId === context.getState().id.value
                    ? updateMyPlay
                    : updateTheirPlay;
            const originalValue =
                reveal.clientId === context.getState().id.value
                    ? context.getState().myPlay.value
                    : context.getState().theirPlay.value;

            await context.dispatch(
                action({
                    type: "Selection",
                    value: SRA.decryptString(
                        originalValue.value as EncryptedSelection,
                        SRAKeySerializationHelper.deserializeSRAKeyPair(reveal.key)
                    ) as PlaySelection,
                })
            );
        }), 2)
    ]), queue, context);

We first define a local transition - we enqueue our PlayAction.

We then repeat 2 times a transition. We update the Redux store accordingly: if the received client ID is ours, we call updateMyPlay(), otherwise we call updateTheirPlay() with the encrypted value.

Next, we enqueue our RevealAction.

We then again repeat 2 times a transition. If the incoming client ID is ours, we call updateMyPlay() and decrypt the originalValue (myPlay.value) with the received key, otherwise we call updateTheirPlay() and decrypt the originalValue (theirPlay.value) with the received key.

Note how this code updates the Redux store directly, by using it as the context for the state machine.

Once the state machine finishes, we should have both our play and the opponentâs play, so we can determine the winner and update the game state accordingly:

const myPlay = context.getState().myPlay.value;
const theirPlay = context.getState().theirPlay.value;

if (myPlay.value === theirPlay.value) {
    await context.dispatch(updateGameStatus("Draw"));
} else if (
    (myPlay.value === "Rock" && theirPlay.value === "Scissors") ||
    (myPlay.value === "Paper" && theirPlay.value === "Rock") ||
    (myPlay.value === "Scissors" && theirPlay.value === "Paper")
) {
    await context.dispatch(updateGameStatus("Win"));
} else {
    await context.dispatch(updateGameStatus("Loss"));
}

And thatâs it in terms of game mechanics. Finally, letâs look at a simple UI for the game.

UI

Weâll build the UI using React. First, letâs create a component that provides the rock-paper-scissors options as 3 buttons:

type ButtonsViewProps = {
    disabled: boolean;
    onPlay: (play: PlaySelection) => void;
}

const ButtonsView = ({ disabled, onPlay }: ButtonsViewProps) => {
    return <div>
        <button disabled={disabled} onClick={() => onPlay("Rock")} style={{ width: 200}}>ðª¨</button>
        <button disabled={disabled} onClick={() => onPlay("Paper")} style={{ width: 200 }}>ð</button>
        <button disabled={disabled} onClick={() => onPlay("Scissors")} style={{ width: 200 }}>âï¸</button>
    </div>
}

Our properties are a boolean that determines whether buttons should be enabled or disabled and an onPlay() callback.

Our view is also very simple:

const useSelector: TypedUseSelectorHook<RootState> = useReduxSelector;

const MainView = () => {
    const idSelector = useSelector((state) => state.id);
    const otherPlayer = useSelector((state) => state.otherPlayer);
    const gameStateSelector = useSelector((state) => state.gameStatus);

    return <div>
        <div>
        <p>Id: {idSelector.value}</p>
        <p>Other player: {otherPlayer.value}</p>
        <p>Status: {gameStateSelector.value}</p>
        </div>
        <ButtonsView disabled={gameStateSelector.value === "Waiting"} onPlay={playRound}></ButtonsView>
    </div>
}

The first line is some React-Redux plumbing (via the react-redux package), which allows us to grab data from the Redux store and put it in the UI.

Weâll be showing our ID, the other playerâs ID, the game status, and the 3 buttons. The buttons are enabled as long as the game state is no Waiting. Once the user clicks a button, we simply call the playRound() function we looked at in the previous section.

Rendering all of this on the page:

const root = ReactDOM.createRoot(document.getElementById("root")!);
root.render(
    <Provider store={store}>
        <MainView />
    </Provider>
);

Here, Provider comes from the react-redux package and makes the Redux store available to the React components.

Initialization

We now have all the pieces into place, the only bit of code we havenât covered is initializing the game:

getLedger<Action>().then(async (ledger) => {
    const id = randomClientId();

    await store.dispatch(updateId(id));

    const queue = await upgradeTransport(2, id, ledger);
    
    await store.dispatch(updateQueue(queue));

    for (const action of ledger.getActions()) {
        if (action.clientId !== id) {
            store.dispatch(updateOtherPlayer(action.clientId));
            break;
        }
    }

    await store.dispatch(updateGameStatus("Ready"));
});

The steps are:

We connect to the Fluid session and get a reference to the ledger, as we saw in the previous post.
We generate a random client ID (Iâm not covering the randomClientId() function in this post, but you can find the implementation in packages/primitives/src/randomClientId.ts).
Update our ID in the Redux store.
We call upgradeTransport() (also discussed in the previous post).
Update the Redux store with a reference to the async queue.
We retrieve and store the other playerâs ID.
We update the game status to Ready (from the default, which is Waiting).

The steps are pretty self-explanatory, maybe except getting the other playerâs ID. The way that works is as follows: getActions() returns all actions posted on the ledger so far. We look for an action where the client ID is different than our client ID and store that as the other playerâs ID. We are guaranteed to see at least one action from the other player, as we ran upgradeTransport(), which under the hood performs a public key exchange.

And thatâs it - we have an end-to-end game of rock-paper-scissors.

Summary

We looked at implementing rock-paper-scissors using the Mental Poker toolkit. The full source code for the demo is under demos/rock-paper-scissors.

Instructions on how to run the game in README.md.
The game model is implemented in model.ts.
The Redux store is implemented in store.ts.
The two React components are buttonsView.tsx and mainView.tsx.
Initialization happens in index.tsx.

Note how easy it is to model a game if we rely on the toolkitâs primitives. We implement the game logic in the model, relying on the toolkitâs capabilities. We use Redux to store game state, which we can easily bind to a React view. That said, this was a very simple game. In the next post weâll look at implementing a card game.

Mental Poker Part 7: Primitives

Wed, 12 Jun 2024 00:00:00 -0700

Mental Poker Part 7: Primitives

For an overview on Mental Poker, seeÂ Mental Poker Part 0: An Overview. Other articles in this seriesÂ here. In the previousÂ post in the seriesÂ we saw how to implement shuffling on top of our primitives.

It this post, weâll look at a few other primitives useful for implementing a game on top of this toolkit.

Creating a transport

We talked about Fluid Framework in previous posts. In part 2, we discussed the Fluid ledger, a distributed data structure which forms the basis of our game message exchange. In part 3, we talked about our ITransport interface and how we can implement it given a ledger. We haveât covered how to get a ledger.

Letâs go back down the stack, all the way to Fluid Framework. Fluid Framework expects clients to agree on the basic layout of the distributed data structures theyâre working with. These data structures are packaged in a container. Note this container has nothing to do with Docker containers, itâs simply a definition for a set of data structures.

Weâll look at a simple implementation of joining a Fluid session and using a container that includes only a ledger. We wonât even try to connect to an instance of the Azure Fluid Relay service, rather weâll use a local server. Instructions for connecting to a service hosted in Azure are here. For our local server, we need a stub user and an AzureLocalConnectionConfig including an InsecureTokenProvider - this is all plumbing to connect to a local instance of the Fluid Relay service:

const user = {
    id: "userId",
    name: "userName",
};

const localConnectionConfig: AzureLocalConnectionConfig = {
    type: "local",
    tokenProvider: new InsecureTokenProvider("", user),
    endpoint: "http://localhost:7070",
};

With this connection config, we can now define a simple container containing a Ledger:

export async function getLedger<T>(): Promise<ITransport<T>> {
    const client = new AzureClient({ connection: localConnectionConfig });

    const containerSchema = {
        initialObjects: { myLedger: Ledger },
    };

    let container: IFluidContainer;
    const containerId = window.location.hash.substring(1);
    if (containerId) {
        ({ container } = await client.getContainer(
            containerId,
            containerSchema
        ));
    } else {
        ({ container } = await client.createContainer(containerSchema));
        const id = await container.attach();
        window.location.hash = id;
    }

    const ledger = container.initialObjects.myLedger as Ledger<string>;

    return makeFluidClient(ledger);
}

We check the browser windowâs URL: if it ends with a GUID, we load the container; if not, we create a new container and add its GUID to the browser windowâs URL. This makes it easy to connect two local clients to the same session:

We launch our web app and the first client will create a container and get a GUID.
We then copy/paste the URL into a separate tab and the second client will connect to the same session and load the container identified by the GUID.

The code above can be found in the demos/transport package. This is used by the other demo apps. Note you need to run the Fluid Framework local service: npx @fluidframework/azure-local-service@latest.

We now have a simple abstraction, getLedger(), that wraps all the Fluid Framework-specifics and gives us back an ITransport interface (implemented as a FluidTransport).

Upgrading the transport

We are building a turn-based, cryptographically secure game, so the first step is to ensure our channel is secure and clients canât spoof each other.

In part 3 we looked at the ITransport interface, the FluidTransport implementation which leverages the Fluid protocol for communication, and the SignedTransport implementation which wraps the FluidTransport and enhances it with signature verification.

Recap of signing: in cryptography, we do signing using a public/private key pair. These are both generated from a shared seed. Alice can sign a message using her private key and anyone that has the public key, including Bob, can verify that the signature is indeed Aliceâs.

So given a public/private key pair  and some payload $P$, singing is a function that produces a signature given the payload and private key $sign(P, K_private) -> signature$. Signature verification is a function that takes a payload, signature, and public key and tells us whether the signature was indeed produced by the corresponding private key $verify(P, signature, K_public) -> true/false$.

The neat thing about public/private key cryptography is that the public key, which is required for validation, is not a secret - only the private key is. Nobody can spoof a signature unless they have the private key (which isnât shared), but everyone with the public key can verify that the signature comes from the private key owner.

So if we start with a FluidTransport, we need our clients to exchange public keys. Each client generates a public/private key pair, and posts its client ID and public key. We use these to populate the key store.

We can implement this on top of the state machine we saw in part 5. First, we define our action and context. As a reminder, the action is what we send over the wire and expect to receive. The context is an object we make available to the code we run whenever an action appears over the transport.

type KeyExchangeAction = {
    clientId: ClientId;
    type: "KeyExchange";
    publicKey: Key;
};

type CryptoContext = {
    clientId: ClientId;
    me: PublicPrivateKeyPair;
    keyStore: KeyStore;
};

In our case our action contains the ClientId, the type (which is KeyExchange), and a public key. Each client is expected to post this over the transport. The context contains our ClientId (so we can tell whether the message came from us or someone else), our public/private key pair, and the KeyStore in which we put all ClientId-to-Key mappings.

A helper function to create the CryptoContext:

async function makeCryptoContext(clientId: ClientId): Promise<CryptoContext> {
    return {
        clientId,
        me: await Signing.generatePublicPrivateKeyPair(),
        keyStore: new Map<ClientId, Key>(),
    };
}

This leverages the cryptography primitives in our toolkit to generate a public/private key pair.

Our sequence to be executed by the state machine is:

function makeKeyExchangeSequence(players: number) {
    return sm.sequence([
        sm.local(
            async (
                actionQueue: IQueue<KeyExchangeAction>,
                context: CryptoContext
            ) => {
                await actionQueue.enqueue({
                    type: "KeyExchange",
                    clientId: context.clientId,
                    publicKey: context.me.publicKey,
                });
            }
        ),
        sm.repeat(
            sm.transition(
                (action: KeyExchangeAction, context: CryptoContext) => {
                    if (action.type !== "KeyExchange") {
                        throw new Error("Invalid action type");
                    }

                    if (action.clientId === undefined) {
                        throw new Error("Expected client ID");
                    }

                    if (context.keyStore.has(action.clientId)) {
                        throw new Error(
                            "Same client posted key multiple times"
                        );
                    }

                    context.keyStore.set(action.clientId, action.publicKey);
                }
            ),
            players
        ),
    ]);
}

Refer to part 5 for the state machine details and a more in-depth explanation of local actions/transitions etc. Our sequence starts with a local action, meaning originating from our client: we post our client ID and public key. Then, for the given number of players we expect in the session, we repeatedly expect an incoming action of type KeyExchangeAction.

In other words, our protocol require each client to start by posting their public key, and each client should expect as many such key postings as clients in the game.

We handle some error cases:

If the incoming action type is not a KeyExchangeAction, one of the clients didnât respect the protocol, so we bail.
If we donât have a client ID, we also bail.
Same if we already saw a key for this client ID - this means either a malicious client is trying to pretend to be another client ID, or a bug in how the protocol was implemented. Regardless, we have to bail.

If we didnât hit any of these issues, then we store the client ID and key in the KeyStore instance. Once the state machine executes this sequence, each client has enough information to create a SignedTransport. Here is a helper function to perform the whole key exchange:

async function keyExchange(
    players: number,
    clientId: ClientId,
    actionQueue: IQueue<BaseAction>
) {
    const context = await makeCryptoContext(clientId);

    const keyExchangeSequence = makeKeyExchangeSequence(players);

    await sm.run(keyExchangeSequence, actionQueue, context);

    return [context.me, context.keyStore] as const;
}

This function takes as input the expected number of players, the ID of this client, and an action queue (as discussed in part 4). The implementation is straight-forward:

We create a context.
We generate a key exchange sequence by calling the function we just saw.
We use our state machine to run the sequence.
We return our private key and the KeyStore (the key store contains only public keys).

And here is a helper function that upgrades a transport to a signed one:

export async function upgradeTransport<T extends BaseAction>(
    players: number,
    clientId: ClientId,
    transport: ITransport<T>
): Promise<IQueue<T>> {
    const [keyPair, keyStore] = await keyExchange(
        players,
        clientId,
        new ActionQueue(
            transport as unknown as ITransport<BaseAction>,
            true
        )
    );

    return new ActionQueue(
        new SignedTransport(
            transport,
            { clientId, privateKey: keyPair.privateKey },
            keyStore,
            new SignatureProvider()
        )
    );
}

This function takes the number of players, our client ID, and an ITransport which doesnât support signature verification. It executes the key exchange, then creates a SignedTransport since it now has all the pieces needed for that. This function goes a step further, and also initializes an async queue on top of the singed transport.

A game that uses the toolkit can go from start to a queue over a signed transport in 3 steps:

const ledger = await getLedger<Action>();

const id = randomClientId();

const queue = await upgradeTransport(2, id, ledger);

In this example, we call getLedger(), which we discussed in the first part of this post, we generate a unique client ID, then we call upgradeTransport(). With these 3 lines of code, we get an ActionQueue over a SignedTransport.

Establishing turn order and shared large prime

The last primitive weâll look at in this post is another key component of Mental Poker: having clients agree who goes first, and agree on a shared large prime (this shared prime is used to generate SRA keys, as discussed in part 1).

These can be separate steps but we can combine them to be more efficient. To establish turn order, we can leverage the ledger distributed data structure which guarantees all clients get all ops in the same sequence: each client posts something, then we simply use the order in which clients see these posts as the turn order.

Hereâs a sketch of the state machine for this:

type EstablishTurnOrderAction = BaseAction;

type EstablishTurnOrderContext = {
    clientId: ClientId;
    turnOrder: ClientId[];
};

function makeEstablishTurnOrderSequence(players: number) {
    return sm.sequence([
        sm.local(async (actionQueue: IQueue<EstablishTurnOrderAction>, context: EstablishTurnOrderContext) => {
            await actionQueue.enqueue({
                type: "EstablishTurnOrder",
                clientId: context.clientId,
            });
        }),
        sm.repeat(sm.transition((action: EstablishTurnOrderAction, context: EstablishTurnOrderContext) => {
            if (action.type !== "EstablishTurnOrder") {
                throw new Error("Invalid action type");
            }

            if (context.turnOrder.find((id) => id === action.clientId)) {
                throw new Error("Same client posted prime multiple times");
            }

            context.turnOrder.push(action.clientId); 
        }), players)
    ]);
}

Our EstablishTurnOrderAction is an alias for BaseAction, as it doesnât contain any additional information, just the client ID. The context contains our clientId and the turn order array we need to populate.

The state machine posts our clientID as an action of type EstablishTurnOrder action. Then for the given number of players, we expect an action of this type. We check that incoming action is of this type, then we check we donât see the same action coming multiple times from the same client. Finally, we add the received clientId to the turnOrder array.

And thatâs it - once this executes, all clients will end up with the same turnOrder array and will know whether it is their turn to act, or they should be waiting for another client to take a turn.

We can extend this implementation to also establish a shared prime: each client posts a prime, then the first one to arrive to others âwinsâ and becomes the shared prime.

Weâll update our EstablishTurnOrderAction to include a prime:

type SerializedPrime = string;

type EstablishTurnOrderAction = BaseAction & { prime: SerializedPrime };

We need to define a SerializedPrime (as a string) to work around the fact that we canât serialize BigInts using JSON.stringify(), which is what weâre using to serialize actions.

We extend our context to also include the shared prime:

type EstablishTurnOrderContext = {
    clientId: ClientId;
    prime: bigint | undefined;
    turnOrder: ClientId[];
};

Our state machine also gets updated:

function makeEstablishTurnOrderSequence(players: number) {
    return sm.sequence([
        sm.local(async (actionQueue: IQueue<EstablishTurnOrderAction>, context: EstablishTurnOrderContext) => {
            await actionQueue.enqueue({
                type: "EstablishTurnOrder",
                clientId: context.clientId,
                prime: BigIntUtils.bigIntToString(BigIntUtils.randPrime()),
            });
        }),
        sm.repeat(sm.transition((action: EstablishTurnOrderAction, context: EstablishTurnOrderContext) => {
            if (action.type !== "EstablishTurnOrder") {
                throw new Error("Invalid action type");
            }

            if (context.turnOrder.length === 0) {
                context.prime = BigIntUtils.stringToBigInt(action.prime);
            }

            if (context.turnOrder.find((id) => id === action.clientId)) {
                throw new Error("Same client posted prime multiple times");
            }

            context.turnOrder.push(action.clientId); 
        }), players)
    ]);
}

The only changes are:

When we enqueue our action, we generate a random prime and serialize it (we have a utility function that does this, which I wonât describe here).
If our turnOrder array is empty, meaning we just received the first action, we set the prime in the context.

With these changes, after we run this state machine we have both the turn order and a prime all clients agree on.

To make calling this easier, we provide a function to initialize the context:

function makeEstablishTurnOrderContext(
    clientId: ClientId
): EstablishTurnOrderContext {
    return {
        clientId,
        prime: undefined,
        turnOrder: [],
    };
}

Then putting it all together:

export async function establishTurnOrder(
    players: number,
    clientId: ClientId,
    actionQueue: IQueue<BaseAction>
) {
    const context = makeEstablishTurnOrderContext(clientId);

    const establishTurnOrderSequence = makeEstablishTurnOrderSequence(players);

    await sm.run(establishTurnOrderSequence, actionQueue, context);

    return [context.prime!, context.turnOrder] as const;
}

We create a context, we create the state machine, then we run it. The function returns the shared prime and the turn order.

Summary

In this post we covered a few primitives or building blocks we can use for building games:

Creating a Fluid transport, and abstracting all the details under a getLedger() function. The code for this is in the demo/transport package, in container.ts.
Upgrading the Fluid transport to a SignedTransport which signs outbound actions and verifies signatures of incoming actions. The code for this is in packages/primitives/upgradeTransport.ts.
Establish turn order for the players and agreeing on a shared large prime. The code for this is in packages/primitives/establishTurnOrder.ts.

With the primitives out of the way, in the next post weâll look at the high-level of modeling a game using the toolkit.

Mental Poker Part 6: Shuffling Implementation

Sun, 07 Apr 2024 00:00:00 -0700

Mental Poker Part 6: Shuffling Implementation

For an overview on Mental Poker, see Mental Poker Part 0: An Overview. Other articles in this series: https://vladris.com/writings/index.html#mental-poker. In the previous post in the series we covered the state machine we use to implement game logic.

We now have all the pieces in place to look at a card shuffling algorithm. Shuffling cards in a game of Mental Poker is one of the key innovations for this type of zero-trust games. We went over the cryptography aspects of shuffling in Part 1.

Let's review the algorithm:

Alice takes a deck of cards (an array), shuffles the deck, generates a secret key $K_A$, and encrypts each card with $K_A$.

Alice hands the shuffled and encrypted deck to Bob. At this point, Bob doesn't know what order the cards are in (since Alice encrypted the cards in the shuffled deck).

Bob takes the deck, shuffles it, generates a secret key $K_B$, and encrypts each card with $K_B$.

Bob hands the deck to Alice. At this point, neither Alice nor Bob know what order the cards are in. Alice got the deck back reshuffled and re-encrypted by Bob, so she no longer knows where each card ended up. Bob reshuffled an encrypted deck, so he also doesn't know where each card is.

At this point the cards are shuffled. In order to play, Alice and Bob also need the capability to look at individual cards. In order to enable this, the following steps must happen:

Alice decrypts the shuffled deck with her secret key $K_A$. At this point she still doesn't know where each card is, as cards are still encrypted with $K_B$.

Alice generates a new set of secret keys, one for each card in the deck. Assuming a 52-card deck, she generates $K_{A_1} ... K_{A_{52}}$ and encrypts each card in the deck with one of the keys.

Alice hands the deck of cards to Bob. At this point, each card is encrypted by Bob's key, $B_K$, and one of Alice's keys, $K_{A_i}$.

Bob decrypts the cards using his key $K_B$. He still doesn't know where each card is, as now the cards are encrypted with Alice's keys.

Bob generates another set of secret keys, $K_{B_1} ... K_{B_{52}}$, and encrypts each card in the deck.

Now each card in the deck is encrypted with a unique key that only Alice knows and a unique key only Bob knows.

If Alice wants to look at a card, she asks Bob for his key for that card. For example, if Alice draws the first card, encrypted with $K_{A_1}$ and $K_{B_1}$, she asks Bob for $K_{B_1}$. If Bob sends her $K_{B_1}$, she now has both keys to decrypt the card and look at it. Bob still can't decrypt it because he doesn't have $K_{A_1}$.

This way, as long as both Alice and Bob agree that one of them is supposed to see a card, they exchange keys as needed to enable this.

Implementation

While we covered the algorithm before, we didn't have the infrastructure in place to implement this. We now do.

Types

We'll start by describing our shuffle actions. As we just saw in the above recap, we have 2 steps:

type ShuffleAction1 = BaseAction & { type: "Shuffle1"; deck: string[] };
type ShuffleAction2 = BaseAction & { type: "Shuffle2"; deck: string[] };

We only need to pass around the deck of cards (encrypted or not), so we extend the BaseAction type (which includes ClientId and type) to pin the type and add the deck.

We need more data in the context though:

type ShuffleContext = {
    clientId: string;
    deck: string[];
    imFirst: boolean;
    keyProvider: KeyProvider;
    commonKey?: SRAKeyPair;
    privateKeys?: SRAKeyPair[];
};

We need to know our clientId, whether we are first or second in the turn order, we need a keyProvider to generate encryption keys, a commonKey (that's for the first encryption step) and privateKeys (for the second encryption step). We'll use the context later on, when we stich everything together. Before that, let's look at the basic shuffling functions.

Shuffling primitives

First, we need a function that shuffles an array:

function shuffleArray<T>(arr: T[]): T[] {
    let currentIndex = arr.length,  randomIndex;
  
    while (currentIndex > 0) {
  
      randomIndex = Math.floor(Math.random() * currentIndex);
      currentIndex--;
  
      [arr[currentIndex], arr[randomIndex]] = [arr[randomIndex], arr[currentIndex]];
    }
  
    return arr;
};

We won't go into the details of this, as it's a generic shuffling function, not specific to Mental Poker, but a required piece.

Let's look at the two shuffling steps next. First step, in which we shuffle and encrypt all cards with the same key:

async function shuffle1(keyProvider: KeyProvider, deck: string[]): Promise<[SRAKeyPair, string[]]> {
    const commonKey = keyProvider.make();

    deck = shuffleArray(deck.map((card) => SRA.encryptString(card, commonKey)));

    return [commonKey, deck];
};

The shuffle1() function takes a keyProvider, a deck, and returns a promise of a shuffled deck plus the key used to encrypt it.

The function is pretty straight-forward: we generate a new key, we encrypt each card with it, then we shuffle the deck. We return the key and the now shuffled and encrypted deck.

Both players need to perform the first step, after which both Alice and Bob have encrypted the deck with $K_A$ and $K_B$ respectively, so neither knows the order of the cards.

The next step, according to our algorithm, is for each player to decrypt the deck with their key and encrypt each card individually with a unique key:

async function shuffle2(commonKey: SRAKeyPair, keyProvider: KeyProvider, deck: string[]): Promise<[SRAKeyPair[], string[]]> {
    const privateKeys: SRAKeyPair[] = [];

    deck = deck.map((card) => SRA.decryptString(card, commonKey));

    for (let i = 0; i < deck.length; i++) {
        privateKeys.push(keyProvider.make());
        deck[i] = SRA.encryptString(deck[i], privateKeys[i]);
    }

    return [privateKeys, deck];
}

shuffle2() is also fairly straight-forward. It takes the commonKey from step 1, a keyProvider, and the encrypted deck.

First, it decrypts all cards using the commonKey (note the cards are still encrypted by the other player). Next, it uses the keyProvider to generate a key for each card, and encrypts each card with the key. The function returns the private keys generated, and the re-encrypted deck.

We now have all the basics in place. Here's how we put it all together:

Shuffling state machine

Here is the state machine that describes the shuffling steps:

function makeShuffleSequence() {
    return sm.sequence([
        sm.local(async (queue: IQueue<ShuffleAction1>, context: ShuffleContext) => {
            if (!context.imFirst) {
                return;
            }

            [context.commonKey, context.deck] = await shuffle1(context.keyProvider, context.deck);

            await queue.enqueue({
                type: "Shuffle1",
                clientId: context.clientId,
                deck: context.deck,
            });
        }),
        sm.transition(async (action: ShuffleAction1, context: ShuffleContext) => {
            if (action.type !== "Shuffle1") {
                throw new Error("Invalid action type");
            }

            context.deck = action.deck;
        }),
        sm.local(async (queue: IQueue<ShuffleAction1>, context: ShuffleContext) => {
            if (context.imFirst) {
                return;
            }

            [context.commonKey, context.deck] = await shuffle1(context.keyProvider, context.deck);

            await queue.enqueue({
                type: "Shuffle1",
                clientId: context.clientId,
                deck: context.deck,
            });
        }),
        sm.transition(async (action: ShuffleAction1, context: ShuffleContext) => {
            if (action.type !== "Shuffle1") {
                throw new Error("Invalid action type");
            }

            context.deck = action.deck;
        }),
        sm.local(async (queue: IQueue<ShuffleAction2>, context: ShuffleContext) => {
            if (!context.imFirst) {
                return;
            }

            [context.privateKeys, context.deck] = await shuffle2(context.commonKey!, context.keyProvider, context.deck);

            await queue.enqueue({
                type: "Shuffle2",
                clientId: context.clientId,
                deck: context.deck,
            });
        }),
        sm.transition(async (action: ShuffleAction2, context: ShuffleContext) => {
            if (action.type !== "Shuffle2") {
                throw new Error("Invalid action type");
            }

            context.deck = action.deck;
        }),
        sm.local(async (queue: IQueue<ShuffleAction2>, context: ShuffleContext) => {
            if (context.imFirst) {
                return;
            }

            [context.privateKeys, context.deck] = await shuffle2(context.commonKey!, context.keyProvider, context.deck);

            await queue.enqueue({
                type: "Shuffle2",
                clientId: context.clientId,
                deck: context.deck,
            });
        }),
        sm.transition(async (action: ShuffleAction2, context: ShuffleContext) => {
            if (action.type !== "Shuffle2") {
                throw new Error("Invalid action type");
            }

            context.deck = action.deck;
        })
    ]);
}

Note we are limiting this to a 2-player game, though we can easily generalize to more players if needed.

This is a longer function so let's break it down:

We start with a local transition: if we are not the first player (based on some previously established turn order), we do nothing. Else we run shuffle1() and post the encrypted deck as a Shuffle1 action.
Next, we expect a Shuffle1 action to arrive - either the one we just posted (if imFirst is true) or incoming from the other player. We store the encrypted and shuffled deck.
Then, we call shuffle1() if we are not the first player - if we are not the first player, then it is our turn to shuffle now. We post another Shuffle1 action.
We again expect a Shuffle1 action to arrive and update the deck.

At this point, both players performed the first step of the shuffle, so the deck is encrypted with $K_A$ And $K_B$ and neither players knows the turn order. We move on to the second step of the shuffle, where each player calls shuffle2() to decrypt the deck and re-encrypt each individual card. Again, depending on whether we are first or not, we take action or wait:

If imFirst is true, call shuffle2() and post a Shuffle2 action.
Expect a Shuffle2 action and update the deck.
If imFirst is not true, call shuffle2() and post a Shuffle2 action.
Expect a Shuffle2 action and update the deck.

A helper function to run this state machine given an async queue:

async function shuffle(
    clientId: string,
    turnOrder: string[],
    sharedPrime: bigint,
    deck: string[],
    actionQueue: IQueue<BaseAction>,
    keySize: number = 128 // Key size, defaults to 128 bytes
): Promise<[SRAKeyPair[], string[]]> {
    if (turnOrder.length !== 2) {
        throw new Error("Shuffle only implemented for exactly two players");
    }

    const context: ShuffleContext = { 
        clientId, 
        deck, 
        imFirst: clientId === turnOrder[0],
        keyProvider: new KeyProvider(sharedPrime, keySize)
    };
    
    const shuffleSequence = makeShuffleSequence();

    await sm.run(shuffleSequence, actionQueue, context);

    return [context.privateKeys!, context.deck];
}

We need our clientId, the turn order (whether we go first or not), a shared large prime (to seed other encryption keys), an unshuffled deck, a queue, and, optionally, a keySize.

From the input, we create a ShuffleContext with the required data, then we generate the state machine by calling the function we discussed previously, and we run the state machine using the given actionQueue and generated context.

We return the private keys with which we encrypted each individual card, and the shuffled and encrypted deck.

Notes on performance

Shuffling a full deck of 52 cards with large enough key sizes gets noticeably slow. Note that we need to generate an encryption key for each card, which involves searching for large prime numbers. The more secure we want the encryption to be, the larger the number of bits we want in the key, the longer it takes to find a key.

This can be mitigated with some loading/progress UI while shuffling. For the demo discard game in mental-poker-toolkit, I used a smaller deck (only cards from 9 to A) and a smaller key size (64 bits).

When implementing a game, it might be a good idea to start generating encryption keys asynchronously as soon as possible - note though that the players need to agree on a shared large prime before key generation can begin.

Summary

In this post we looked at an implementation of card shuffling.

We recapped the shuffling algorithm for Mental Poker, which enables playing zero-trust card games.
We implemented the two steps of shuffling as shuffle1() and shuffle2().
We defined the state machine that models 2-player shuffling.
We went over a helper function that runs the state machine and outputs the shuffled deck.
We briefly discussed performance of shuffling.

The Mental Poker Toolkit is here. This post covered card shuffling, which is implemented in the primitives package in shuffle.ts.

Mental Poker Part 5: State Machine

Fri, 22 Mar 2024 00:00:00 -0700

Mental Poker Part 5: State Machine

In this post, we'll finally look at the infrastructure on top of which we'll model games. The type of games we're considering can all be modeled as state machines¹. The challenge is we need a generic enough framework that works for any game, so let's consider what they all have in common.

Transitions

We can't tell what the exact states of a game are, as they depend on the specific game. But, in general, game play implies transitioning from one state to another.

Local transitions

In some cases, an action originates on our client. For example: we pick between rock, paper, or scissors; we want to draw a card etc. This means we need to run some logic on our client, then send an Action over our transport to other clients.

To keep things generic and unopinionated, the minimal interface for this is a function that takes an action queue and a context.

type LocalTransition<TAction extends BaseAction, TContext> = (
    actionQueue: IQueue<TAction>,
    context: TContext
) => void | Promise<void>;

We covered the queue in the previous post. We need this in a local transition because we will run some code then, in most cases, we'll want to enqueue an action and send it to other players. We'll look at an example of this later on in this post.

The context can be anything - this enables the game to pass-through whatever data the function needs. Our state machine implementation doesn't care about what that data is, this is just the mechanism to make it available to the code in the function.

The function can return either void or a Promise in case it needs to be async.

Remote transitions

In other cases, an action arrives over the transport. This is an action that was sent either by another player, or by us and we receive it back from the server after it has been sequenced².

In this case, our interface is a function that takes the incoming Action and a context.

type Transition<TAction extends BaseAction, TContext> = (
    action: TAction,
    context: TContext
) => void | Promise<void>;

In this case, we don't necessarily need access to the queue, since we won't enqueue an action, rather we're processing one. The context is, again, up to the consumer of this API.

The function similarly returns void or a Promise in case it needs to be async.

Runnable transition

Finally, we need an abstraction over both LocalTransition and Transition so when we specify our state machine we can treat them the same way. We'll use RunnableTransition for this:

type RunnableTransition<TContext> = {
    actionQueue: IQueue<BaseAction>,
    context: TContext
}: Promise<void>;

We expect users of our library to write code in terms of local transitions (LocalTransition) and remote transitions (Transition). This type is meant to be used internally. Note we are doing some type erasure here as we're going from a generic IQueue to a IQueue. That's because we need to work with the queue in our library code, but the exact Action types depend on the game.

For local transitions, we simply pass through the actionQueue. For remote transitions, we dequeue an action and pass that. We'll see how to do this next.

We're also normalizing return to be Promise regardless of whether the transition function originally returned void or Promise.

State Machine

Our state machine is implemented as a set of functions. First, we have a few factory functions. local() creates a RunnableTransition from a LocalTransition:

function local<TAction extends BaseAction, TContext>(
    transition: LocalTransition<TAction, TContext>
): RunnableTransition<TContext> {
    return async (queue: IQueue<BaseAction>, context: TContext) =>
        await Promise.resolve(
            transition(queue as IQueue<TAction>, context)
        );
}

We call Promise.resolve() to get a Promise regardless of whether the given transition is a synchronous or asynchronous function.

remote() converts a remote transition into a RunnableTransition:

function transition<TAction extends BaseAction, TContext>(
    transition: Transition<TAction, TContext>
): RunnableTransition<TContext> {
    return async (queue: IQueue<BaseAction>, context: TContext) => {
        const action = await queue.dequeue();
        await Promise.resolve(transition(action as TAction, context));
    };
}

Here, we dequeue an action, then pass it to the given transition.

In many cases, we expect multiple players to take the same action, for example each player picks between rock, paper, or scissors - in this case, we will expect one remote action coming in from each player (including us), of the same type. Most times we want to treat these actions the same way, which means we want to run the same Transition function for each. The repeat() function takes a RunnableTransition and repeats it a given number of times:

function repeat<TContext>(
    transition: RunnableTransition<TContext>,
    times: number
): RunnableTransition<TContext>[] {
    return Array(times).fill(transition);
}

This gives as an array of RunnableTransitions we can execute in sequence.

Finally, we might want to combine the output of calling local() with the output of calling repeat() into a longer sequence of RunnableTransitions we can run - the first function gives us a RunnableTransition, the second function gives us an array of RunnableTransitions. To address this, we provide sequence:

function sequence<TContext>(
    transitions: (
        | RunnableTransition<TContext>
        | RunnableTransition<TContext>[]
    )[]
): RunnableTransition<TContext>[] {
    return transitions.flat();
}

This function takes an array of RunnableTransitions, or an array of arrays, and calls flat() on this to flatten nested array into a single, flat list.

Once we have a sequence of transitions, we can run them using run():

async function run<TContext>(
    sequence: RunnableTransition<TContext>[],
    queue: IQueue<BaseAction>,
    context: TContext
) {
    for (const transition of sequence) {
        await transition(queue, context);
    }
}

We simply execute each RunnableTransition in turn.

Understandably, this has all been abstract. Let's now see how we can use these functions to model interactions.

Interactions

Let's look at a simple example: key exchange: in order to secure our transport, we want each client to share a public key, then sign each subsequent message with their corresponding private key.

Key exchange

We looked at securing the transport layer in this post. We haven't discussed the key negotiation though.

Let's create the following protocol: as each client joins the game, they post a public key. For an N player game, each client should expect N remote transitions consisting of clients publishing public keys. Once all of these were processed, we should have all public keys for all clients and can create a SignedTransport.

Let's sketch out the state machine:

function makeKeyExchangeSequence(players: number) {
    return sm.sequence([
        sm.local(async (actionQueue: IQueue<KeyExchangeAction>, context: CryptoContext) => {
            // Post public key ...
        }),
        sm.repeat(sm.transition((action: KeyExchangeAction, context: CryptoContext) => {
            // Store incoming public key ...
        }), players)
    ]);
}

Note we create a LocalTransition in which we post our own public key, and we repeat the remote transition handling an incoming public key (remember with Fluid we expect the server to also send us back whatever we post).

Clients can join the game at different times, so we don't know in what order the keys will come in but, luckily, each Action has a clientId so we know who's key it is.

We'll look at the implementation of the transitions but first let's see what are the KeyExchangeAction and CryptoContext:

type KeyExchangeAction = {
    clientId: ClientId;
    type: "KeyExchange";
    publicKey: Key;
};

type CryptoContext = {
    clientId: ClientId;
    me: PublicPrivateKeyPair;
    keyStore: KeyStore;
};

KeyExchange is an action consisting of clientId and publicKey, with the type set to "KeyExchange".

CryptoContext is the context needed by the transitions implementing the key exchange - that is we need to know our own clientId, our public-private key pair, and we need a keyStore, which is a map of clientId to public key. We looked at the KeyStore and the other key types in a previous blog post, but here they are again for reference:

type Key = string;

type PublicPrivateKeyPair = {
    publicKey: Key;
    privateKey: Key;
};

type KeyStore = Map<ClientId, Key>;

With these in place, let's look at the implementation of the transitions:

function makeKeyExchangeSequence(players: number) {
    return sm.sequence([
        sm.local(
            async (
                actionQueue: IQueue<KeyExchangeAction>,
                context: CryptoContext
            ) => {
                // Post public key
                await actionQueue.enqueue({
                    type: "KeyExchange",
                    clientId: context.clientId,
                    publicKey: context.me.publicKey,
                });
            }
        ),
        sm.repeat(
            sm.transition(
                (action: KeyExchangeAction, context: CryptoContext) => {
                    // This should be a KeyExchangeAction
                    if (action.type !== "KeyExchange") {
                        throw new Error("Invalid action type");
                    }

                    // Protocol expects clients to post an ID
                    if (action.clientId === undefined) {
                        throw new Error("Expected client ID");
                    }

                    // Protocol expects each client to only post once and to have a unique ID
                    if (context.keyStore.has(action.clientId)) {
                        throw new Error(
                            "Same client posted key multiple times"
                        );
                    }

                    context.keyStore.set(action.clientId, action.publicKey);
                }
            ),
            players
        ),
    ]);
}

sm stands for state machine. The functions described above live in a StateMachine namespace aliased to sm.

Our local transition is simple: we enqueue a KeyExchangeAction, sending our clientId and publicKey from the CryptoContext.

When a remote action comes in, we perform the required validations:

Ensure it is a KeyExchangeAction.
Ensure it has a clinetId.
Ensure the same client doesn't post two different public keys.

Finally, we store the clientId and publicKey.

The end-to-end implementation for key exchange, relying on the state machine, is here:

async function makeCryptoContext(clientId: ClientId): Promise<CryptoContext> {
    return {
        clientId,
        me: await Signing.generatePublicPrivateKeyPair(),
        keyStore: new Map<ClientId, Key>(),
    };
}

async function keyExchange(
    players: number,
    clientId: ClientId,
    actionQueue: IQueue<BaseAction>
) {
    const context = await makeCryptoContext(clientId);

    const keyExchangeSequence = makeKeyExchangeSequence(players);

    await sm.run(keyExchangeSequence, actionQueue, context);

    return [context.me, context.keyStore] as const;
}

makeCryptoContext() is a helper function to initialize a CryptoContext instance - it takes a clientId, generates a public-private key pair, and initializes an empty key store.

keyExchange() calls the functions we defined previously to get a CryptoContext, the key exchange sequence, and calls the state machine's run() to execute the key exchange.

Once done, it returns the client's public-private key pair, and the key store.

From a caller's perspective, the protocol handling key exchange is now abstracted away behind the keyExchange() function. The caller doesn't have to worry about the mechanics of exchanging keys, rather can just call this and get back all the required data to create a SignedTransport.

Rock-paper-scissors

As a second example, we'll sketch out the state machine for a game of rock-paper-scissors. We won't dive into all the implementation details. At a high level, here is how we play a game of rock-paper-scissors:

Each player picks from rock, paper, or scissors, encrypts their selection, and posts it.
Once selections are posted, each player posts the key they used to encrypt.

This two-step ensures players are committed to a selection and can't cheat by observing what the other player picked and picking afterwards.

The state machine for this game is:

A local transition in which we make our local selection.
Two remote transitions, getting from the server our selection and the other player's.
A local transition in which we share our encryption key.
Two remote transitions, getting the encryption keys from the server.

The state machine is:

sm.sequence([
    sm.local(async (queue, context) => {
        // Post our play action
    }),
    sm.repeat(sm.transition(async (action, context) => {
        // Both player and opponent need to post their encrypted selection
    }), 2),
    sm.local(async (queue, context) => {
        // Post our reveal action
    }),
    sm.repeat(sm.transition(async (reveal: RevealAction, context: RootStore) => {
        // Both player and opponent need to reveal their selection
    }), 2)
]);

We won't fill in the functions in this post but this gives you an idea of how we can model a more complex set of steps using our library.

Summary

In this post we looked at a state machine we can use to implement games:

The state machine needs to be very unopinionated as each game implements its own logic, defines its own Action types, and has its own relevant context.
Local transitions are functions we initiate locally and they usually end with an action being posted.
Remote transitions are functions we run in response to actions arriving from the servers - these could've been originated by us or by another client.
RunnableTransition is a common type that can wrap local or remote transitions.
We can combine transitions by repeating them, or concatenating them into sequences. Once we have a sequence of transitions, we can run it to implement a protocol.
We saw how key exchange can be implemented on top of a state machine and sketched out the state machine for a game of rock-paper-scissors.

The Mental Poker Toolkit is here. This post covered the state-machine package, the key exchange is implemented in the primitives package.

https://en.wikipedia.org/wiki/Finite-state_machine. ↩
Sequenced is a Fluid Framework term. Clients send messages to the Fluid relay service, which orders them in the order they came in and broadcasts them to all clients. This is to ensure all clients eventually see all the messages sent in the same order. ↩

Mental Poker Part 4: Actions and Async Queue

Sat, 16 Mar 2024 00:00:00 -0700

Mental Poker Part 4: Actions and Async Queue

As I was building up the library and looking at state machines that would run turns in a game, I realized an async queue would come in handy. The challenge with the raw ITransport interface built on top of the Fluid ledger is that if you are not the first client to join a session, you end up with a set of ops that already exist on the ledger. You need a way to consume both the ops that were already sequenced and new incoming ops. An async interface is also easier to consume than callbacks.

Before diving into that though, letâs talk about actions.

Actions

As a reminder, op is the Fluid Framework term for data being sent/received. In Mental Poker we use actions. All actions should be subtypes of BaseAction:

export type ClientId = string;

export type BaseAction = {
    clientId: ClientId;
    type: unknown;
};

Every action should have a clientId showing which client it came from, and a type.

For example, hereâs how we would model a game of Rock/Paper/Scissors:

Both players pick rock or paper or scissors, encrypt their selection, and post it on the ledger.
Next, both players post their encryption key, so the other player can decrypt and see the selection.

We model the game in these two steps so regardless of which player moves first, the player choices are revealed after they have been put on the ledger. If a player would simply post their unencrypted selection, the other player might cheat by looking at it before posting their own.

I will cover the Rock/Paper/Scissors implementation in detail in a future post, for now, letâs just go over the actions:

export type PlayAction = {
    clientId: ClientId;
    type: "PlayAction";
    encryptedSelection: EncryptedSelection;
};

export type RevealAction = {
    clientId: ClientId;
    type: "RevealAction";
    key: SerializedSRAKeyPair;
};

export type Action = PlayAction | RevealAction;

The two actions described above are modeled as PlayAction and RevealAction. Both of these have a clientId and type, thus are subtypes of BaseAction. Finally, the Action type represents all possible actions in the game.

This becomes relevant as we move higher in the stack of the Mental Poker library. Once we start encoding some of the game semantics, we require generic types to extend BaseAction. This is what happens with the async queue.

Async queue

As I mentioned at the beginning of the article, queues aim to provide a nicer API over the transport. The interface is very simple:

export interface IQueue<T extends BaseAction> {
    enqueue(value: T): Promise<void>;

    dequeue(): Promise<T>;
}

For any type T extending BaseAction, we can enqueue() a value and we can dequeue() a value. Both of the operations are asynchronous.

Iâll show the full implementation then go over the details:

export class ActionQueue<T extends BaseAction> implements IQueue<T> {
    private queue: T[] = [];

    constructor(
        private readonly transport: ITransport<T>,
        preseed: boolean = false
    ) {
        transport.on("actionPosted", (value) => {
            this.queue.push(value);
        });

        if (preseed) {
            for (const value of transport.getActions()) {
                this.queue.push(value);
            }
        }
    }

    async enqueue(value: T) {
        await this.transport.postAction(value);
    }

    async dequeue(): Promise<T> {
        const result = this.queue.shift();
        if (result) {
            return Promise.resolve(result);
        }

        return new Promise<T>((resolve) => {
            this.transport.once("actionPosted", async () => {
                resolve(await this.dequeue());
            });
        });
    }
}

The implementation maintains an array of Ts (actions). The constructor takes a transport argument of type ITransport and preseed flag:

constructor(
    private readonly transport: ITransport<T>,
    preseed: boolean = false
) {
    transport.on("actionPosted", (value) => {
        this.queue.push(value);
    });

    if (preseed) {
        for (const value of transport.getActions()) {
            this.queue.push(value);
        }
    }
}

/* ... */

The queue starts listening to the actionPosted event and whenever we have an incoming value, we push it to the internal queue. If preseed is true, we also push all actions already posted to the queue.

The reason we make this optional is that we might end up using multiple queues in a game implementation but we only want to consume the actions posted on the ledger before we joined the session once. After we are âup to speedâ, new incoming actions fire events which we can consume in realtime. So we would usually create our first queue with preseed set to true and subsequent queues with preseed set to false.

Enqueuing a value is trivial - we leverage the transportâs postAction API:

/* ... */

async enqueue(value: T) {
    await this.transport.postAction(value);
}

/* ... */

Dequeuing is a bit more interesting:

/* ... */

async dequeue(): Promise<T> {
    const result = this.queue.shift();
    if (result) {
        return Promise.resolve(result);
    }

    return new Promise<T>((resolve) => {
        this.transport.once("actionPosted", async () => {
            resolve(await this.dequeue());
        });
    });
}

/* ... */

First, we call shift() on the queue. This either returns a value or undefined if the queue is empty.

If we do get a value, we return a resolved promise right away.

If we donât have a value, we add a one-time listener to the actionPosted event. When a new action is posted, the underlying transport will fire the event. Since event listeners are called in the order they subscribed, we are guaranteed the listener we added in the constructor fires first, and adds the value to queue. We resolve the promise by recursively calling dequeue() and awaiting the response.

The reason we do this is we might have multiple callers to dequeue() holding on to promises. In this case, we donât want to resolve all of them with the incoming value, rather just the first one. The first recursive call to dequeue() should grab the value from the internal queue and return it right away, while other recursive callers would end up awaiting again until a new value comes in. There's probably a more efficient non-recursive implementation but for our specific use-case (games), we don't expect many cases where we have multiple dequeus pending.

Using the queue

There are two main reasons for using this queue rather than relying directly on the underlying transport.

First, the underlying transport can have a set of actions (messages) that already arrived on the client (which we would retrieve with the getActions() method), and some which arrive in real time (which would fire events). The queue gives us a unified way to consume both, by calling await dequeue().

Besides a unified interface, we expect multiple spots in the code to wait for an incoming action. This depends on the game implementation, but usually at different game states we expect different messages to come in. This is harder to achieve waiting for event callbacks and much easier to do via the same await dequeue() call.

Summary

In this post we looked at actions, the key building blocks of Mental Poker games, and an async queue which provides a clean abstraction over the underlying transport.

The code covered in this post is available on GitHub in theÂ mental-poker-toolkitÂ repo. BaseAction and the ITransport and IQueue interfaces are part of the core types package packages/types. ActionQueueÂ is implemented underÂ packages/action-queue.

Notes on Advent of Code 2023

Wed, 10 Jan 2024 00:00:00 -0800

Notes on Advent of Code 2023

I always have fun with Advent of Code every December, and last year I did write a blog post covering some of the more interesting problems I worked through. I'll continue the tradition this year.

I'll repeat my disclaimer from last time:

Disclaimer on my solutions

I use Python because I find it easiest for this type of coding. I treat solving these as a write-only exercise. I do it for the problem-solving bit, so I don't comment the code & once I find the solution I consider it done - I don't revisit and try to optimize even though sometimes I strongly feel like there is a better solution. I don't even share code between part 1 and part 2 - once part 1 is solved, I copy/paste the solution and change it to solve part 2, so each can be run independently. I also rarely use libraries, and when I do it's some standard ones like re, itertools, or math. The code has no comments and is littered with magic numbers and strange variable names. This is not how I usually code, rather my decadent holiday indulgence. I wasn't thinking I will end up writing a blog post discussing my solutions so I would like to apologize for the code being hard to read.

All my solutions are on my GitHubÂ here.

This time around, I did use GitHub Copilot, with mixed results. In general, it mostly helped with tedious work, like implementing the same thing to work in different directions - there are problems that require we do something while heading north, then same thing while heading east etc. I did also observe it produce buggy code that I had to manually edit.

I'll skip over the first few days as they tend to be very easy.

Day 9

Problem statement is here.

This is an easy problem, I just want to call out a shortcut: for part 2, to exact same algorithm as in part 1 works if you first reverse the input. This was a neat discovery that saved me a bunch of work.

Day 10

Problem statement is here.

Part 1 was again very straightforward. I found part 2 a bit more interesting, especially the fact that we can determine whether a tile is inside or outside our loop by only looking at a single row (or column). We always start outside, then scan each tile. If we hit a |, then we toggle from outside to inside and vice-versa. If we hit an L or a F, we continue while we're on a - (these are all parts of our loop), and we stop on the 7 or J. If we started on L and ended on J or started on F and eded on 7 - meaning the pipe bends and turns back the way we came, we don't change our state. On the other hand, if the pipe goes down from L to 7 or up from F to J, then we toggle outside/inside. For each non-pipe tile, if we're inside, we count it. Maybe this is obvious but it took me a bit to figure it out.

def scan_line(ln):
    total, i, inside, start = 0, -1, False, None
    while i < len(grid[0]) - 1:
        i += 1
        if (ln, i) not in visited:
            if inside:
                total += 1
        else:
            if grid[ln][i] == '|':
                inside = not inside
                continue
            
            # grid[ln][i] in 'LF'
            start = grid[ln][i]
            i += 1
            while grid[ln][i] == '-':
                i += 1

            if start == 'L' and grid[ln][i] == '7' or \
               start == 'F' and grid[ln][i] == 'J':
               inside = not inside
    return total

In the code above, visited tracks pipe segments (as opposed to tiles that are not part of the pipe).

Day 11

Problem statement is here.

Day 11 was easy, so not much to discuss. Use Manhattan distance for part 1 and in part 2, just add 999999 for every row or column crossed that doesn't contain any galaxies.

Day 12

Problem statement is here.

Part 1 was very easy.

Part 2 was a bit harder because just trying out every combination takes forever to run. I initially tried to do something more clever around deciding when to turn a ? into # or . depending on what's around it, where we are in the sequence, etc. But ultimately it turns out just adding memoization made the combinatorial approach run very fast.

Day 13

Problem statement is here.

This was a very easy one, so I won't cover it.

Day 14

Problem statement is here.

This was easy but part 2 was tedious, having to implement tilt functions for various directions. This is where Copilot saved me a bunch of typing.

Once we have the tilt functions, we can implement a cycle function that tilts things north, then west, then south, then east. Finally, we need a bit of math to figure out the final position: we save the state of the grid after each cycle and as soon as we find a configuration we encountered before, it means we found our cycle. Based on this, we know how many steps we have before the cycle, what the length of the cycle is, so we can compute the state after 1000000000 cycles:

pos = []
while (state := cycle()) not in pos:
    pos.append(state)

lead, loop = pos.index(state), len(pos) - pos.index(state)
d = (1000000000 - lead) % loop

With this, we need to count the load of the north support beams for the grid we have at pos[lead + d - 1].

Day 15

Problem statement is here.

Another very easy one that I won't cover.

Day 16

Problem statement is here.

This one was also easy and tedious, as we have to handle the different types of reflections. Another one where Copilot saved me a lot of typing.

Day 17

Problem statement is here.

Part 1

This was a fairly straightforward depth-first search, where we keep a cache of how much heat loss we have up to a certain point. The one interesting complication is that we can only move forward 3 times. In the original implementation, I keyed the cache on grid coordinates + direction we're going in + how many steps we already took in that direction. This worked in reasonable time.

Part 2

In part 2, we now have to move at least 4 steps in one direction and at most 10. The cache I used in part 1 doesn't work that well anymore. On the other hand, I realized that rather than keeping track of direction and how many steps we took in that direction so far, I can model this differently: we are moving either horizontally or vertically. If we're at some point and moving horizontally, we can expand our search to all destination points (from 4 to 10 away horizontally or vertically) and flip the direction. For example, if we just moved horizontally to the right, we won't move further to the right as we already covered all those cases, and we won't move back left as the crucible can't turn 180 degrees. That means the only possible directions we can take are up or down in this case, meaning since we just moved horizontally, we now have to move vertically.

This makes our cache much smaller: our key is the coordinates of the cell and the direction we were moving in. This also makes the depth-first search complete very fast.

best, end = {}, 1000000

def search(x, y, d, p):
    global end

    if p >= end:
        return

    if x == len(grid) - 1 and y == len(grid[0]) - 1:
        if p < end:
            end = p
        return

    if (x, y, d) in best and best[(x, y, d)] <= p:
        return

    best[(x, y, d)] = p
    
    if d != 'H':
        if x + 3 < len(grid[x]):
            pxr = p + grid[x + 1][y] + grid[x + 2][y] + grid[x + 3][y]
            for i in range(4, 11):
                if x + i < len(grid):
                    pxr += grid[x + i][y]
                    search(x + i, y, 'H', pxr)

        if x - 3 >= 0:
            pxl = p + grid[x - 1][y] + grid[x - 2][y] + grid[x - 3][y]
            for i in range(4, 11):
                if x - i >= 0:
                    pxl += grid[x - i][y]
                    search(x - i, y, 'H', pxl)

    if d != 'V':
        if y + 3 < len(grid[0]):
            pyd = p + grid[x][y + 1] + grid[x][y + 2] + grid[x][y + 3]
            for i in range(4, 11):
                if y + i < len(grid[0]):
                    pyd += grid[x][y + i]
                    search(x, y + i, 'V', pyd)

        if y - 3 >- 0:
            pyu = p + grid[x][y - 1] + grid[x][y - 2] + grid[x][y - 3]
            for i in range(4, 11):
                if y - i >= 0:
                    pyu += grid[x][y - i]
                    search(x, y - i, 'V', pyu)

I realized this approach actually applies well to part 1 too, and retrofitted it there. The only difference is instead of expanding to the cells +4 to +10 in a direction, we expand to the cells +1 to +3.

Day 18

Problem statement is here.

Part 1

The first part is easy - we plot the input on a grid, then flood fill to find the area.

In the below code, dig is the input, processed as a tuple of direction and number of steps:

x, y, grid = 0, 0, {(0, 0)}
for dig in digs:
    match dig[0]:
        case 'U':
            for i in range(dig[1]):
                y -= 1
                grid.add((x, y))
        case 'R':
            for i in range(dig[1]):
                x += 1
                grid.add((x, y))
        case 'D':
            for i in range(dig[1]):
                y += 1
                grid.add((x, y))
        case 'L':
            for i in range(dig[1]):
                x -= 1
                grid.add((x, y))

x, y = min([x for x, _ in grid]), min([y for _, y in grid])
while (x, y) not in grid:
    y += 1

queue = [(x + 1, y + 1)]
while queue:
    x, y = queue.pop(0)

    if (x, y) in grid:
        continue

    grid.add((x, y))
    queue += [(x + 1, y), (x - 1, y), (x, y + 1), (x, y - 1)]

print(len(grid))

Part 2

Part 2 is trickier, as the number are way larger and the same flood fill algorithm won't work. My approach was to divide the area into rectangles: as we process all movements, we end up with a set of (x, y) tuples of points where our line changes direction. If we sort all the x coordinates and all y coordinates independently, we end up with a grid where we can treat each pair of subsequent xs and ys as describing a rectangle on our grid.

x, y, points = 0, 0, [(0, 0)]
for dig in digs:
    match dig[0]:
        case 0: x += dig[1]
        case 1: y += dig[1]
        case 2: x -= dig[1]
        case 3: y -= dig[1]

    if dig[1] < 10:
        print(dig[1])
    points.append((x, y))

xs, ys = sorted({x for x, _ in points}), sorted({y for _, y in points})

Where digs above represents the input, processed as before into direction and number of steps tuples.

Now points contains all the connected points we get following the directions, which means a pair of subsequent points describes a line. Once we have this, we can start a flood fill in one of the rectangles and proceed as follows: if there is a north boundary, meaning we have a line between our top left and top right coordinates, then we don't recurse north; otherwise we go to the rectangle north of our current rectangle and repeat the algorithm there. Same for east, south, west.

Since we have to consider each point in the terrain in our area calculation, we need to be careful how we measure the boundaries of each rectangle so we don't double-count or omit points. To ensure this, my approach was that for each rectangle we count, we count an extra line north (if there is no boundary) and an extra line east (if there is no boundary). If there's neither a north nor an east boundary, then we add 1 for the north-east corner. This should ensure we don't double-count, as each rectangle only considers its north and east boundaries, and we don't miss anything, as any rectangle without a boundary will count the additional points. What remains is the perimeter of our surface, which we add it at the end. The explanations might sound convoluted, but the code is very easy to understand:

queue, total, visited = [(1, 1)], 0, set()
while queue:
    x, y = queue.pop(0)

    e = min([i for i in xs if i > x])
    s = max([i for i in ys if i < y])
    w = max([i for i in xs if i < x])
    n = min([i for i in ys if i > y])

    if (n, e) in visited:
        continue
    visited.add((n, e))

    total += (e - w - 1) * (n - s - 1)

    found_n, found_s, found_e, found_w = False, False, False, False
    for l1, l2 in zip(points, points[1:]):
        if l1[1] == l2[1]:
            if l1[1] == n and (l1[0] < x < l2[0] or l2[0] < x < l1[0]):
                found_n = True
            if l1[1] == s and (l1[0] < x < l2[0] or l2[0] < x < l1[0]):
                found_s = True
        elif l1[0] == l2[0]:
            if l1[0] == e and (l1[1] < y < l2[1] or l2[1] < y < l1[1]):
                found_e = True
            if l1[0] == w and (l1[1] < y < l2[1] or l2[1] < y < l1[1]):
                found_w = True
                
    if not found_n:
        total += e - w - 1
        queue.append((x, n + 1))
    if not found_s:
        queue.append((x, s - 1))
    if not found_e:
        total += n - s - 1
        queue.append((e + 1, y))
    if not found_w:
        queue.append((w - 1, y))

    if not found_n and not found_e:
        if (e, n) not in points:
            total += 1

total += sum([dig[1] for dig in digs])

Day 19

Problem statement is here.

Part 1

For the first part, we can process rule by rule.

Part 2

For the second part, start with bounds: (1, 4000) for all of xmas. Then at each decision point, recurse updating bounds. Whenever we hit an A, add the bounds to the list of accepted bounds.

Bounds are guaranteed to never overlap, by definition.

accepts = []

def execute_workflow(workflow_key, bounds):
    workflow = workflows[workflow_key]
    for rule in workflow:
        if rule == 'A':
            accepts.append(bounds)
            return
        if rule == 'R':
            return
        if rule in workflows:
            execute_workflow(rule, bounds)
            return

        check, next_workflow = rule.split(':')
        if '<' in check:
            key, val = check.split('<')
            nb = bounds.copy()
            nb[key] = (nb[key][0], int(val) - 1)
            bounds[key] = (int(val), bounds[key][1])
        elif '>' in check:
            key, val = check.split('>')
            nb = bounds.copy()
            nb[key] = (int(val) + 1, nb[key][1])
            bounds[key] = (bounds[key][0], int(val))

        execute_workflow(next_workflow, nb)

execute_workflow('in', {'x': (1, 4000), 'm': (1, 4000), 'a': (1, 4000), 's': (1, 4000)})

This gives us all accepted ranges for each of x, m, a, and s.

Day 20

Problem statement is here.

Part 1

For the first part, we can model the various module types as classes with a common interface and different implementations. Since one of the requirements is to process pulses in the order they are sent, we will use a queue rather than have objects call each other based on connections. So rather than module A directly calling connected module B when it receives a signal (which would cause out-of-order processing), model A will just queue a signal for module B, which will be processed once the signals queued before this one are already processed.

I won't share the code here as it is straightforward. You can find it on my GitHub.

Part 2

This one was one of the most interesting problems this year. Simply simulating button presses wouldn't work. I ended up dumping the diagram as a dependency graph and it looks like the only module that signals rx is a conjunction module with multiple inputs.

Conjunction modules emit a low pulse when they remember high pulses being sent by all their connected inputs. In this case, we can simulate button presses and keep track when each input to this conjunction module emits a high pulse. Then we compute the least common multiple of these to determine when the rx module will get a low signal.

My full solution is here, though I'm still pretty sure it is topology-dependent. Meaning we might have a different set up where the inputs to this conjunction model are not fully independent, which might make LCM not return the correct answer.

Day 21

Problem statement is here.

Part 1

Part 1 is trivial, we can easily simulate 64 steps and count reachable spots.

Part 2

The second part is much more tricky - this is actually the problem I spent the most time on. Since the garden is infinite, and we are looking for a very high number of steps, we can't use the same approach as in part 1 to simply simulate moves.

Let's now call a tile a repetition of the garden on our infinite grid. Say we start with the garden at (0, 0). Then as we expand beyond its bounds, we reach tiles (-1, 0), (1, 0), (0, -1), (0, 1), which are repetitions of our initial garden.

The two observations that helped here were:

Once we reach all possible spots in a garden, following steps just cycle between the same two sets of reachable spots. Meaning once we spend enough time in a garden, we know how many steps are reachable in that particular garden by just looking at the modulo of total number of steps.
As the number of steps increases over the infinitely repeating garden, there is a pattern to how the covered area grows. This is a diamond shape where the center is always fully covered garden tiles (see the first observation above) and the surrounding tiles are at various stages of being visited.

In fact, after we grow beyond the first 4 surrounding tiles, it seems like the garden grows with a periodicity of the size of the garden. Meaning every len(grid) steps, we reach new tiles. There are a few cases to consider - north, east, south, west, diagonals.

My approach was to do a probe - simulate the first few steps and record the results.

def probe():
    dx, dy = len(grid) // 2, len(grid[0]) // 2
    tiles, progress = {(dx, dy)}, {(0, 0): {0: 1}}
    
    i = 0
    while len(progress) < 41:
        i += 1
        new_tiles = set()
        for x, y in tiles:
            if grid[(x - 1) % len(grid)][y % len(grid[0])] != '#':
                new_tiles.add((x - 1, y))
            if grid[(x + 1) % len(grid)][y % len(grid[0])] != '#':
                new_tiles.add((x + 1, y))
            if grid[x % len(grid)][(y - 1) % len(grid[0])] != '#':
                new_tiles.add((x, y - 1))
            if grid[x % len(grid)][(y + 1) % len(grid[0])] != '#':
                new_tiles.add((x, y + 1))

        tiles = new_tiles

        for x, y in tiles:
            sq_x, sq_y = x // len(grid), y // len(grid[0])
            if (sq_x, sq_y) not in progress:
                progress[(sq_x, sq_y)] = {}
            if i not in progress[(sq_x, sq_y)]:
                progress[(sq_x, sq_y)][i] = 0
            progress[(sq_x, sq_y)][i] += 1

    return progress

Here progress keeps track, for each tile (keyed as set of (x, y) coordinates offset from (0, 0)), of how many spots are reachable at a given time. I run this until progress grows enough for the repeating pattern to show - because we start from the center of a garden but in all other tiles we enter from a side, it takes a couple of iterations for the pattern to stabilize. My guess is this probe could be smaller with some better math, but that's what I have.

With this, given a number of steps, we can reduce it using steps % len(grid) to a smaller value we can loop in our progress record. The reasoning being, if the pattern repeats, it doesn't really matter whether we are 3 steps into tile (-1000, 0) or 3 steps into tile (-3, 0).

The tedious part was determining the right offsets and special cases when computing the total number of squares. For example, even for the tiles that are fully covered, we'll have a subset where tiles are on the odd state of squares and a subset where tiles are on the âeven" state.

I ended up with the following formula (which might still be buggy, but seemed to have worked for my input):

def at(x, y, step):
    return progress[(x, y)][step] if step in progress[(x, y)] else 0


def count(steps):
    even, odd = (1, 0) if steps % 2 == 0 else (0, 1)

    for i in range(1, steps // len(grid)):
        if steps % 2 == 0:
            if i % 2 == 0:
                even += 4 * i
            else:
                odd += 4 * i
        else:
            if i % 2 == 0:
                odd += 4 * i
            else:
                even += 4 * i

    total = even * at(0, 0, len(grid) * 2) + odd * at(0, 0, len(grid) * 2 + 1)

    total += at(-3, 0, len(grid) * 3 + steps % len(grid))
    total += at(3, 0, len(grid) * 3 + steps % len(grid))
    total += at(0, -3, len(grid) * 3 + steps % len(grid))
    total += at(0, 3, len(grid) * 3 + steps % len(grid))

    i = steps // len(grid) - 1

    total += i * at(-1, -1, len(grid) * 2 + steps % len(grid))
    total += i * at(-1, 1, len(grid) * 2 + steps % len(grid))
    total += i * at(1, -1, len(grid) * 2 + steps % len(grid))
    total += i * at(1, 1, len(grid) * 2 + steps % len(grid))
    
    i += 1
    
    total += i * at(-2, -1, len(grid) * 2 + steps % len(grid))
    total += i * at(-2, 1, len(grid) * 2 + steps % len(grid))
    total += i * at(2, -1, len(grid) * 2 + steps % len(grid))
    total += i * at(2, 1, len(grid) * 2 + steps % len(grid))
    
    return total

I'm covering all inner even and âodd" tiles, then the directly north, east, south, and west tiles, then two layers of diagonals. Again, I have a feeling this could be simpler, but I didn't bother to optimize it further.

Day 22

Problem statement is here.

Part 1

For part one, we sort bricks by z coordinate (ascending), then we make each brick fall. We do this by decrementing their z coordinate and checking whether they intersect with any other brick.

def intersect(brick1, brick2):
    if brick1[0].x > brick2[1].x or brick1[1].x < brick2[0].x:
        return False
    
    if brick1[0].y > brick2[1].y or brick1[1].y < brick2[0].y:
        return False
    
    if brick1[0].z > brick2[1].z or brick1[1].z < brick2[0].z:
        return False
    
    return True


def slide_down(brick, delta):
    return (Point(brick[0].x, brick[0].y, brick[0].z - delta), Point(brick[1].x, brick[1].y, brick[1].z - delta))


def fall(brick):
    if min(brick[0].z, brick[1].z) == 1:
        return 0

    result, orig = 0, brick
    while True:
        brick = slide_down(brick, 1)
        for b in bricks:
            if b == orig:
                continue

            if intersect(brick, b):
                return result

        result += 1
        if min(brick[0].z, brick[1].z) == 1:
            return result


bricks = sorted(bricks, key=lambda b: min(b[0].z, b[1].z))

for i, brick in enumerate(bricks):
    if delta := fall(brick):
        bricks[i] = slide_down(brick, delta)

Once every brick that could fall has fallen to its final position, we need to find the critical bricks - the bricks that are the only support for some other bricks. We do this by shifting down each brick again 1 z and determining how many bricks it intersects with. If a shifted brick only intersects with one other brick, that is a âcriticalbrick, so we add it to our set of âcritical support bricks. All other bricks can be safely removed.

critical = set()
for brick in bricks:
    if brick[0].z == 1 or brick[1].z == 1:
        continue

    supported_by = []
    nb = slide_down(brick, 1)
    for i, b in enumerate(bricks):
        if brick == b:
            continue

        if intersect(nb, b):
            supported_by.append(i)

    if len(supported_by) == 1:
        critical.add(supported_by[0])

print(len(bricks) - len(critical))

Part 2

In part 2, we need to figure out which bricks is each brick supported by. We can use a similar algorithm to part 1, where we shift z by 1 and check which bricks we intersect. Then we can build a dependency graph of which bricks is supported by which other bricks.

supported_by = {}
for i, brick in enumerate(bricks):
    supported_by[i] = set()

    if brick[0].z == 1 or brick[1].z == 1:
        continue

    nb = slide_down(brick, 1)
    for j, b in enumerate(bricks):
        if i == j:
            continue

        if intersect(nb, b):
            supported_by[i].add(j)

Then for each brick we remove, we can walk the supported by dependencies to determine which bricks would fall and would, in turn, cause other bricks to fall, without having to actually simulate falling.

def count_falling(i):
    sup = {k: supported_by[k].copy() for k in supported_by.keys()}
    queue, removed = [i], set()
    while queue:
        i = queue.pop(0)
        
        if i in removed:
            continue
        removed.add(i)

        for j in sup:
            if i in sup[j]:
                sup[j].remove(i)
                if len(sup[j]) == 0:
                    queue.append(j)

    return len(removed) - 1


print(sum(count_falling(i) for i in range(len(supported_by))))

Day 23

Problem statement is here.

The main insight here for both part 1 and part 2 is that we can model the paths as a graph where each intersection (decision point) is a vertex and the paths between intersections are edges. With this representation, we simply need to find the longest path between our starting point and our end point.

In part 1, we have a directed graph, as right before hitting each intersection, we have a ><^v constraint, making the path one-way. In part 2, we have an undirected graph.

Note that the longest path problem in a graph is harder than the shortest path problem. That said, we are dealing with extremely small graphs.

Day 24

Problem statement is here.

Part 1

Part 1 was fairly straightforward: for each pair of lines, solve the equation to find where they meet and check if within bounds (when lines are not parallel).

Since each line is described by a point $(x_{origin}, y_{origin})$ and a vector $(dx, dy)$, we can represent them as

\[\begin{cases} x = x_{origin} + dx * t \\ y = y_{origin} + dy * t \end{cases}\]

Then the lines intersect when

\[\begin{cases} x_1 + dx_1 * t_1 = x_2 + dx_2 * t_2 \\ y_1 + dy_1 * t_1 = y_2 + dy_2 * t_2 \end{cases}\]

We know all of $(x_1, y_1), (dx_1, dy_1), (x_2, y_2), (dx_2, dy_2)$ so we solve for $t_1$ and $t_2$.

def intersect(p1, v1, p2, v2):
    if v1.dx / v1.dy == v2.dx / v2.dy:
        return None, None

    t2 = (v1.dx * (p2.y - p1.y) + v1.dy * (p1.x - p2.x)) / (v2.dx * v1.dy - v2.dy * v1.dx)
    t1 = (p2.y + v2.dy * t2 - p1.y) / v1.dy

    return t1, t2

Once we have t1 and t2, we need to check both are positive (so intersection didn't happen in the past), and make sure the intersection point, which is either x1 + dx1 * t1, y1 + dx1 * t1 or x2 + dx2 * t2, y2 + dx2 * t2, is within our bounds (at leastÂ 200000000000000Â and at mostÂ 400000000000000).

If that's the case, then we found an intersection and we can add it to the total.

Part 2

Part 2 was really fun. We now have 3 dimensions, so a line is represented as

\[\begin{cases} x = x_{origin} + dx * t \\ y = y_{origin} + dy * t \\ z = z_{origin} + dz * t \end{cases}\]

We need to find a line (the trajectory of our rock) that intersects each line in our input at a different time, such that for some $t$ and line $l$, we have

\[\begin{cases} x_{origin_{l}} + dx_l * t = x_{origin_{rock}} + dx_{rock} * t \\ y_{origin_{l}} + dy_l * t = y_{origin_{rock}} + dy_{rock} * t \\ z_{origin_{l}} + dz_l * t = z_{origin_{rock}} + dz_{rock} * t \end{cases}\]

One way to solve this is using linear algebra. If we take 3 different hailstorms and our rock, we end up with the following set of equations:

\[\begin{cases} x_{origin_{1}} + dx_1 * t_1 = x_{origin_{rock}} + dx_{rock} * t_1 \\ y_{origin_{1}} + dy_1 * t_1 = y_{origin_{rock}} + dy_{rock} * t_1 \\ z_{origin_{1}} + dz_1 * t_1 = z_{origin_{rock}} + dz_{rock} * t_1 \\ x_{origin_{2}} + dx_2 * t_2 = x_{origin_{rock}} + dx_{rock} * t_2 \\ y_{origin_{2}} + dy_2 * t_2 = y_{origin_{rock}} + dy_{rock} * t_2 \\ z_{origin_{2}} + dz_2 * t_2 = z_{origin_{rock}} + dz_{rock} * t_2 \\ x_{origin_{3}} + dx_3 * t_3 = x_{origin_{rock}} + dx_{rock} * t_3 \\ y_{origin_{3}} + dy_3 * t_3 = y_{origin_{rock}} + dy_{rock} * t_3 \\ z_{origin_{3}} + dz_3 * t_3 = z_{origin_{rock}} + dz_{rock} * t_3 \end{cases}\]

In the above system, we know all of the starting points and vectors of the hailstorms. Our unknowns are $t_1, t_2, t_3, x_{origin_{rock}}, y_{origin_{rock}}, z_{origin_{rock}}, dx_{rock}, dy_{rock}, dz_{rock}$. That's 9 unknowns to 9 equations, so it should be solvable.

While this approach works, I didn't want to use a numerical library to solve this (I'm trying to keep dependencies at a minimum), and implementing the math from scratch was a bit too much for me. I thought of a different approach: as long as we can find a rock trajectory that intersects the first couple of hailstorms at the right times, we most likely found our solution.

\[\begin{cases} x_{origin_{rock}} + dx_{rock} * t_1 = x_1 + dx_1 * t_1 \\ y_{origin_{rock}} + dy_{rock} * t_1 = y_1 + dy_1 * t_1 \\ x_{origin_{rock}} + dx_{rock} * t_2 = x_2 + dx_2 * t_2 \\ y_{origin_{rock}} + dy_{rock} * t_2 = y_2 + dy_2 * t_2 \end{cases}\]

If we solve this for $t_1$ and $t_2$, we can then easily determine $z_{origin_{rock}}$ and $dz_{rock}$.

In the above set of equations, we have too many unknowns: $x_{origin_{rock}}, dx_{rock}, y_{origin_{rock}}, dy_{rock}, t_1, t_2$. We can reduce this number by trying out different values for a couple of these unknowns. While the ranges of possible values for $x_{origin_{rock}}, y_{origin_{rock}}, t_1, t_2$ are very large, so unfeasible to cover, $dx_{origin}$ and $dy_{origin}$ ranges should be small - if these values are large, our rock will quickly shoot past all the other hailstorms.

My approach was to try all possible values between -1000 and 1000 for both of these, then see if we can find $x_{origin_{rock}}, y_{origin_{rock}}, t_1, t_2$ such that these intersect the first two hailstorms. If we do, we then find $z_{origin_{rock}}, dz_{rock}$ (easy to find since now we know $t_1, t_2$). We have an additional helpful constraint: the origin coordinates of the rock need to be integers.

Then we just need to check that indeed for the given $(x_{origin_{rock}}, y_{origin_{rock}}, z_{origin_{rock}})$ and $(dx_{rock}, dy_{rock}, dz_{rock})$, for each hailstorm, there is a time $t_i$ when they intersect.

Here is the code:

def find(rng):
    for dx in range(-rng, rng):
        for dy in range(-rng, rng):
            x1, y1, z1 = hails[0][0]
            dx1, dy1, dz1 = hails[0][1]
            x2, y2, z2 = hails[1][0]
            dx2, dy2, dz2 = hails[1][1]

            # x + dx * t1 = x1 + dx1 * t1
            # y + dy * t1 = y1 + dy1 * t1
            # x + dx * t2 = x2 + dx2 * t2
            # y + dy * t2 = y2 + dy2 * t2

            # x = x1 + t1 * (dx1 - dx)        
            # t1 = (x2 - x1 + t2 * (dx2 - dx)) / (dx1 - dx)
            # y = y1 + (x2 - x1 + t2 * (dx2 - dx)) * (dy1 - dy) / (dx1 - dx)
            # t2 = ((y2 - y1) * (dx1 - dx) - (dy1 - dy) * (x2 - x1)) / ((dy1 - dy) * (dx2 - dx) + (dy - dy2) * (dx1 - dx))

            if (dy1 - dy) * (dx2 - dx) + (dy - dy2) * (dx1 - dx) == 0:
                continue

            t2 = ((y2 - y1) * (dx1 - dx) - (dy1 - dy) * (x2 - x1)) / ((dy1 - dy) * (dx2 - dx) + (dy - dy2) * (dx1 - dx))

            if not t2.is_integer() or t2 < 0:
                continue

            if (dx1 - dx) == 0:
                continue

            y = y1 + (x2 - x1 + t2 * (dx2 - dx)) * (dy1 - dy) / (dx1 - dx)

            if not y.is_integer():
                continue

            t1 = (x2 - x1 + t2 * (dx2 - dx)) / (dx1 - dx)

            if not t1.is_integer() or t1 < 0:
                continue

            x = x1 + t1 * (dx1 - dx)        

            # z + dz * t1 = z1 + dz1 * t1
            # z + dz * t2 = z2 + dz2 * t2        

            # dz = (z1 + dz1 * t1 - z2 - dz2 * t2) / (t1 - t2)
            # z = z1 + dz1 * t1 - dz * t1

            if t1 == t2:
                continue

            dz = (z1 + dz1 * t1 - z2 - dz2 * t2) / (t1 - t2)

            if not dz.is_integer():
                continue

            z = z1 + dz1 * t1 - dz * t1

In the above x, y, z, dx, dy, dz are the rock's origin and vector.

The final step (omitted from the code sample for brevity), is to confirm that for the given origin and vector, we end up eventually intersecting all other hailstorms.

I really enjoyed this problem as it made me work through the math.

Day 25

Problem statement is here.

I liked this problem. It turned out to be a variation of the minimum cut problem. Trying out all possible permutations of nodes would take way too much time. The algorithm I used keeps track of a set of visited nodes - one of the two components. Then at each step, we add a new node to this set by selecting the most connected node to this component (meaning the node that has most edges incoming from visited nodes).

most_connected() determines which node we want to pick next:

def most_connected(visited):
    best_n, best_d = None, 0
    for n in graph:
        if n in visited:
            continue

        neighbors = sum(1 for v in graph[n] if v in visited)
        if neighbors > best_d:
            best_n, best_d = n, neighbors

    return best_n

Then we keep going until our component has exactly 3 outgoing edges to nodes that haven't ben visited yet:

def find_components():
    start = list(graph.keys())[0]
    visited = {start}
    while len(visited) < len(graph):
        total = 0
        for n in visited:
            total += sum(1 for v in graph[n] if v not in visited)
        
        if total == 3:
            return visited

        n = most_connected(visited)
        visited.add(n)

That's where we need to make the cut. We just need to multiply len(visited) with len(graph) - len(visited) to find our answer.

I personally found the most difficult problems to be part 2 of day 20, 21, 24 and the one and only part of day 25. All of these took me a bit to figure out. That said, Advent of Code is always a nice holiday past-time and I can't wait for the 2024 iteration.

Notes on Platform Development

Thu, 28 Dec 2023 00:00:00 -0800

Notes on Platform Development

I spent the past few years building a platform for Loop components within the Microsoft 365 ecosystem. While some of the learnings might only apply to our particular scenario, I think some observations apply broadly.

Weâve been using 1P/2P/3P to mean our team (1P), other teams within Microsoft (2P), and external developers (3P). Loop started with a set of 1P components and we set out to extract a developer platform out of these that can be leveraged by other teams. We currently have a set of 2P components built on our platform, and a 3P developer story centered around Adaptive Cards.

In this blog post Iâll cover some of my learnings with regard to platform development.

1P != 3P

Aspirationally, we set out with the stated goal of 1P equals 3P, meaning 3rd party developers should be building on the same platform as 1st party developers. Looking at it another way, if the platform is good enough for 1st party, it should be just as good for 3rd party - this is a statement of platform capabilities and maturity and a lofty goal.

That said, I donât think this is realistic, especially within a product like Office, where user experience is paramount. That is because we have two audiences to consider: we have the developer audience - users building on our platform, and we have Office users, people who get to use the end product. Mediating between the two is quite a challenge.

A simple example is the classic performance/security tradeoff. Especially as Loop components are embedded in other applications, what level of isolation do we provide? Loop components are built with web technology. An iframe provides great isolation (best security) but iframes add performance overhead (worse perf). If we host a Loop component without an iframe, we get better performance, but we open up the whole DOM to the component. If we threat model this, we immediately see that we donât necessarily need isolation for Loop components developed within Microsoft (we donât expect our partner teams to write malicious code) but we absolutely need to isolate code written by 3rd party developers. Of course, we could say âjust isolate everythingâ, which might even have other advantages, but do we want to take the perf hit? Our other audience, people who use our product, would be negatively impacted by an overhead we can technically avoid.

Another example in the same vein: overall user experience. The more we make Loop components feel like part of the hosting app, the smoother the end user experience is. On the other hand, we canât realistically test every single Loop component built by any 3rd party developer. The way Office services and products are deployed and administered, tenant admins can configure which 3rd party extensions are enabled within the tenant. The Microsoft tenant we use internally has set some set of extensions available, but not all. That means there are always 3rd party extensions we never even see. Now if one of these extensions doesnât work properly (errors out, looks out of place, is slow etc.), end users might end up dissatisfied with the overall experience of using Office products. For internally developed components, we get to dogfood and keep a high bar, but this doesnât scale to a wide developer audience. Our current approach is to offer 3rd party development via Adaptive Cards. This way, we donât run 3rd party code on clients and we have a set of consistent UI controls. Ideally, weâd like to enable custom code but this at the time of writing weâre still thinking through the best approach considering all of the challenges listed above.

Finally, I think another key difference is the product goals. The platform audience are the developers, but the product audience are the users. Thereâs usually a tension between these. For example, an internal team builds a Loop component. They come up with a requirement that is a âmustâ to deliver their scenario. For example, we had a component developed by a partner team that asked us to check the tenantâs Cloud Policy service to see whether the component should be on or off. This makes perfect sense in this case, since the backing service might not be running in the tenant. We offer tenant admins a different way to control 3rd party extensions, so this platform capability would not make sense for a 3rd party. In general, a lot of our internal platform capability requests come from the desire to provide the best possible end user experience. If our only customer were the developers using the platform, we would probably say ânoâ to some of these - not general enough, doesnât benefit 3rd party etc. But, of course, Office has way more users than developers.

I think the 1P/3P challenge is common to most platforms built from within product teams (or supporting product teams within the same company). With Loop, this is compounded by the fact we are deeply integrated within other applications. I can think of some notable examples when the strong push for a â1P equals 3Pâ platform ended up disastrously - Windows Longhorn was supposed to be built on a version of .NET that was just not good enough for core OS pieces. I can also think of many platforms that provide sufficient capabilities for 3rd party developers but 1st/2nd party developers donât use. And I think this is OK - building a platform for 3P lets you focus on the developer community needs. Supporting 1P/2P might be best served by focusing on the product goals and unique scenario needs rather than trying to generalize to a public platform.

Life stages

A platform goes through several life stages, each with its own characteristics and challenges. Looking back at how our platform evolved (and how I foresee the future), a successful platform goes through 4 life stages: incubation, 0 to 1, stabilization, and commoditization.

Incubation

At this stage, itâs all one team building both the what-will-become-a-platform and the product supported by this platform. During the incubation stage, the platform doesnât really have any users (meaning developers leveraging the platform). We are free to toy with ideas. If we want to make a breaking change to an API, we can easily do it and fix the handful of internal calls. At this point, everything is in flux - the canvas is blank and we have plenty of room to innovate.

On the other hand, we donât really have a clear idea of what developers would need out of the platform - we know what the main scenario we are supporting needs, but we donât have a feedback loop yet. At this stage, we need to rely on experience and intuition to set some initial direction.

0 to 1

This is the biggest growth stage. â0 to 1â is a nod to Peter Thielâs Zero to One book. The platform goes from no users to a few users - and by âusersâ here I mean developers. Taking the platform from 0 (or incubation) to 1, means supporting a handful of âseriousâ production scenarios.

We now have a feedback loop and developers able to give us requirements - we can now understand their needs rather than have to divine them ourselves. As a side note, this is the approach we took with Loop, where we worked closely with a set of 2P partners to light up scenarios and grow the platform to support these.

At this stage, itâs already difficult to make breaking changes. Since there are already a set of dependencies on the platform, a breaking change requires a lot of coordination. Or some form of backwards compatibility. Or legacy support. There are different ways to go about this (maybe in another blog post), but the key point is we can no longer churn as fast as we could during the incubation stage. And added costs at the 0 to 1 stage are painful.

Another challenge is generalization. We have a handful of partners with a handful of requests for the platform. And weâre in the growth stage, so we most likely need to move fast. Thereâs a big tension between how fast we can light up new platform capabilities and how much time we spend thinking through design patterns and future-proofing. If we just say âyesâ to every ask, we can move fast but risk ending up with a very gnarly platform that has many one-off pieces and a very inconsistent developer story. On the other hand, we can spend a lot of time iterating on design and predicting how an incoming requirement would scale when the platform is large, all the way until our partners give up on us or funding runs out. There is no silver bullet for this - you always end up somewhere in the middle, with parts of the platform that you wished were done differently, but hopefully still alive and kicking in the next stage.

Stabilization

At this point, enough developers depend on the platform that ad-hoc breaking changes are no longer possible. By âstabilizationâ I donât mean the platform stops growing - in fact, this is the stage where we get most feedback and requests. But while the platform continues to grow incrementally, changes become even more difficult as they can break the whole ecosystem.

There are now enough user that early design decision that proved wrong become obvious, but itâs too late to change them. This is a natural âif I knew then what I know nowâ point for any platform that canât really be avoided.

This is the point where most platform start producing new major version numbers that aim to address large swats of issues and add new bundles of functionality. But while during the incubation stage, a change could land in a few days, and in the 0 to 1 stage maybe weeks or at most months, breaking changes at this stage take years to land - many developers means not all of them are ready right-away to update their code to the newest patterns. The platform needs some form of long-term support for older versions and deprecation/removal becomes a long journey.

On the other hand, the core of the platform is stable by now and battle-tested. The final step is the platform becoming a commodity.

Commoditization

At this stage, the platform is mature and robust. A large developer community depends on it and the platform is mostly feature complete. Some new requirements might pop up from time to time, but not very often.

At this stage developers rely on existing behaviors and change is next to impossible. Thatâs because a lot of the developer solutions are also âdoneâ by now and people moved on. Nobody wants to go back and update things to support API changes. The platform is a useful commodity.

This is also the stage where active development slows down and fewer engineers are required to keep things going. We havenât reached this stage with Loop, we are still growing the platform and moving fast. But any successfully platform should reach this stage - a low-churn state where its capabilities (and gotchas) are well understood and reliable.

Each of the stages require a different approach to evolving the platform. The speed with which we add capabilities, churn, how updates are rolled out, how we design new features - all happen in different ways and at a different pace depending on where the platform is and its number of users.

Summary

In this post I covered two main aspects of platform development: the tension between supporting 3rd party developers and ensuring end users have the best possible experience; and the different stages of a platform. As usage increases, changes become more difficult and early decisions solidify, for better or worse.

If I look at other platforms, I can easily see how they went through the same growing pains and challenges.

Iâll probably have more to write on the topic of platform development, since this has been my main job for a while now.

Mental Poker Part 3: Transport

Tue, 28 Nov 2023 00:00:00 -0800

Mental Poker Part 3: Transport

Now that my LLM book is done, I can get back to the Mental Poker series. A high-level overview can be found here. In the previous posts we covered cryptography and a Fluid append-only list data structure. Weâll be using the append-only list (we called this fluid-ledger) to model games.

An append-only list should be all that is needed to model turn-based games: each turn is an element added to the list. In this post, weâll stitch things together and look at the transport layer for our games.

Transport

Our basic transport interface is very simple:

declare interface ITransport<T> {
    getActions(): IterableIterator<T>;

    postAction(value: T): Promise<void>;

    once(event: "actionPosted", listener: (value: T) => void): this;
    on(event: "actionPosted", listener: (value: T) => void): this;
    off(event: "actionPosted", listener: (value: T) => void): this;
}

For some type T, we have:

A getActions(), which returns an iterator over all values (of type T) posted so far.
A postAction(), which takes a value of type T and an actionPosted event which fires whenever any of the clients posts an action (this relies on the Fluid data synchronization).
And the standard EventEmitter methods.

We'll cover why we call these values actions in a future post.

The basic implementation of this on top of the fluid-ledger distributed data structure looks like this:

class FluidTransport<T> extends EventEmitter implements ITransport<T> {
    constructor(private readonly ledger: ILedger<string>) {
        super();
        ledger.on("append", (value) => {
            this.emit("actionPosted", JSON.parse(value) as T);
        });
    }

    *getActions() {
        for (const value of this.ledger.get()) {
            yield JSON.parse(value) as T;
        }
    }

    postAction(value: T) {
        return Promise.resolve(this.ledger.append(JSON.stringify(value)));
    }
}

The constructor takes an ILedger (this is the interface we looked at in the previous post).

It hooks up an event listener to the ledger's append event to in turn trigger an actionPosted event. We also convert the incoming value from string to T using JSON.parse().

Similarly, getActions() is a simple wrapper over the underlying ledger, doing the same conversion to T.

Finally, the postAction() does the reverse - it converts from T to a string and appends the value to the ledger.

With this in place, we abstracted away the Fluid-based transport details. We will separately set up a Fluid container and establish connection to other clients (in a future post), then take the ILedger instance, pass it to FluidTransport, and we are good to go.

We can model games on top of just these two primitives: postAction() and actionPosted. Whenever we take a turn, we call postAction(). Whenever any player takes a turn, the actionPosted event is fired.

Since weâre designing Mental Poker, which takes place in a zero-trust environment, letâs make sure our transport is secure.

Signature verification

Signature verification allows us to ensure that in a multiplayer game, players canât spoof each other, meaning Alice canât pretend she is Bob and post an action on Bobâs behalf for other clients to misinterpret.

Note in a 2-player game this is not strictly needed if we trust the channel: we know that if a payload was not sent by us, it was sent by the other player. But in games with more players, we need to protect against spoofing. Signatures are also useful in case we donât trust the channel - maybe itâs supposed to be a 2-player game but a third client gets access to the channel and starts sending messages.

We will implement this using public key cryptography. The way this works is each player generates (locally) a public/private key pair. They broadcast the public key to all other players. Then they can sign any message they send with their private key and other players can validate the signature using the public key. Nobody else can sign on their behalf, since the private key is kept private.

I wonât go into deeper detail here, since this is very standard public key cryptography. In fact, I didnât even cover this in the blog post covering cryptography for Mental Poker for this reason. There, I focused on the commutative SRA encryption algorithm. Unlike SRA, which we had to implement by hand, signature verification is part of the standard Web Crypto API. Letâs implement signature verification on top of this.

First, we need to model a public/private key pair:

// Keys are represented as strings
export type Key = string;

// Public/private key pair
export type PublicPrivateKeyPair = {
    publicKey: Key;
    privateKey: Key;
};

A key is a string. We model the key pair as PublicPrivateKeyPair, a type containing two keys. Hereâs how we generate the key pair using the Web Crypto API:

import { encode, decode } from "base64-arraybuffer";

async function generatePublicPrivateKeyPair(): Promise<PublicPrivateKeyPair> {
    const subtle = crypto.subtle;
    const keys = await subtle.generateKey(
        {
            name: "rsa-oaep",
            modulusLength: 4096,
            publicExponent: new Uint8Array([1, 0, 1]),
            hash: "sha-256",
        },
        true,
        ["encrypt", "decrypt"]
    );

    return {
        publicKey: encode(await subtle.exportKey("spki", keys.publicKey)),
        privateKey: encode(
            await subtle.exportKey("pkcs8", keys.privateKey)
        ),
    };
}

We use subtle to generate our key pair and return both public and private keys as base64-encoded strings.

We can similarly rely on subtle for signing. The following function takes a string payload and signs it with the given private key. The response is the base64-encoded signature.

async function sign(
    payload: string,
    privateKey: Key
): Promise<string> {
    const subtle = crypto.subtle;

    const pk = await subtle.importKey(
        "pkcs8",
        decode(privateKey),
        { name: "RSA-PSS", hash: "SHA-256" },
        true,
        ["sign"]
    );

    return encode(
        await subtle.sign(
            { name: "RSA-PSS", saltLength: 256 },
            pk,
            decode(payload)
        )
    );
}

First, we import the given privateKey, then we call subtle.sign() to sign the base64-decoded payload. We re-encode the signature to base64 and return it as a string.

Finally, this is how we verify signatures:

async function verifySignature(
    payload: string,
    signature: string,
    publicKey: Key
): Promise<boolean> {
    const subtle = crypto.subtle;

    const pk = await subtle.importKey(
        "spki",
        decode(publicKey),
        { name: "RSA-PSS", hash: "SHA-256" },
        true,
        ["verify"]
    );

    return subtle.verify(
        { name: "RSA-PSS", saltLength: 256 },
        pk,
        decode(signature),
        decode(payload)
    );
}

Here, we import the given publicKey, then we use subtle.verify(). For signature verification, we pass in a signature and the payload that was signed (decoded from base64). This API returns true if the signature matches, meaning it was indeed signed with the private key corresponding to the public key we provided.

Again, I wonât go deep into the subtle APIs as they are standard and very well documented. The main takeaway is now we have 3 APIs:

generatePublicPrivateKeyPair() to generate key pairs.
sign() to sign a payload.
verify() to validate the signature.

Weâll put these in the Signing namespace.

Now letâs layer this cryptography over our FluidTransport.

Signed transport

Now that we have our Fluid-based implementation of the ITransport interface and signature verification functions, weâll provide another implementation of this interface that handles signature verification.

First, we need a generic Signed type:

type clientId = string;

type Signed<T> = T & { clientId?: ClientId; signature?: string };

This takes any type T and extends it with an optional clientId and signature. Weâll represent client IDs as strings.

Now we can decorate any payload in our transport with these optional clientID and signature, which we can then validate using the functions we just implemented. The reason these are optional is that we have states when signing is unavailable: before clients exchange public keys. During the key exchange steps, no message can be signed, since no client knows the public key of any other client. These messages canât be signed. Once keys are exchanged, all subsequent messages should be signed, and weâll enforce that in SignedTransport.

We also need a KeyStore. This keeps track of which public key belongs to each client, to help with our signature verification (meaning we keep track of which public key is Aliceâs, which one is Bobâs and when we get a message from Alice we know which key to use to verify authenticity).

type KeyStore = Map<ClientId, Key>;

We also need a ClientKey type, representing a single client ID/private key pair:

export type ClientKey = { clientId: ClientId; privateKey: Key };

With these additional type definitions in place, we can start building our SignedTransport. This is a decorator that takes an ITransport>. Weâll first look at the constructor:

class SignedTransport<T> extends EventEmitter implements ITransport<T> {
    constructor(
        private readonly transport: ITransport<Signed<T>>,
        private readonly clientKey: ClientKey,
        private readonly keyStore: KeyStore
    ) {
        super();
        transport.on("actionPosted", async (value) => {
            this.emit("actionPosted", await this.verifySignature(value));
        });
    }

/* ... */

This new class has 3 private properties. Letâs discuss them in turn.

transport is our underlying ITransport>. The idea is we can instantiate a FluidTransport (or other transport if needed, though for this project I have no plans of using another transport than Fluid), then pass it in the constructor here. Then SignedTransport will use the provided instance for postAction() and actionPosted, simply adding signature verification over it.

The clientKey should be this clientâs ID and private key. This class is not concerned with key generation, just signature and verification, so weâll have to generate the key pair somewhere else and pass it. Weâll use this to sign our outgoing payloads.

We also pass in a keyStore. This should have the client ID to public key mapping for all players in the game. We use this to figure out which public key to use to validate each posted action.

Existing actions

getActions() simply calls the underlying transport - we are not doing signature verification on existing messages, since they were likely sent before the signed transport was created and cannot be verified.

*getActions() {
    for (const value of this.transport.getActions()) {
        yield value;
    }
}

We only validate incoming actions.

Incoming actions

The constructor body hooks up the actionPosted event to the transportâs actionPosted. So whenever the underlying transport fires the event, the SignedTransport will also fire an actionPosted event. But instead of just passing value through, we call verifySignature() on the value first.

Letâs look at verifySignature next (this is also part of the SignedTransport class):

private async verifySignature(value: Signed<T>): Promise<T> {
    if (!value.clientId || !value.signature) {
        throw Error("Message missing signature");
    }

    // Remove signature and client ID from object and store them
    const clientId = value.clientId;
    const signature = value.signature;

    delete value.clientId;
    delete value.signature;

    // Figure out which public key we need to use
    const publicKey = this.keyStore.get(clientId);

    if (!publicKey) {
        throw Error(`No public key available for client ${clientId}`);
    }

    if (
        !(await Signing.verifySignature(
            JSON.stringify(value),
            signature,
            publicKey
        ))
    ) {
        throw new Error("Signature validation failed");
    }

    return value;
}

/* ... */

Since value is a Signed, we should have a clientId and a signature. We throw an exception if we canât find them.

Next, we clean up value and remove the clientId and signature from the object. As we return this to other layers in our stack, they no longer need this as weâre handling signature verification here.

We then try to retrieve the public key of the client from the keyStore. We again throw in case we donât have the key.

We use the verifySigntature() function we implemented earlier to ensure the signature is valid. We throw if not.

At this point, we guaranteed that the payload is coming from the client claiming to have sent it. If Alice tries to forge a message and pretend itâs coming from Bob, she wouldnât be able to produce a valid Bob signature (since only Bob has access to his private key). Such a message would not make it past this function.

If no exceptions were thrown, this function returns a value (with signature cleaned up), ready to be processed by other layers.

Outgoing actions

Letâs now look at adding signatures to postAction(). signAction() is another private class member handling signing:

private async signAction(value: T): Promise<Signed<T>> {
    const signature = await Signing.sign(
        JSON.stringify(value),
        this.clientKey.privateKey
    );

    return {
        ...value,
        clientId: this.clientKey.clientId,
        signature: signature,
    };
}

/* ... */

We call the sign() function we implemented earlier in this post, passing it the stringified value and our clientâs private key. We then extend value with the corresponding clientId and signature.

The postAction() implementation uses this function for signing, before calling the underlyingâs transport postAction().

async postAction(value: T) {
    this.transport.postAction(await this.signAction(value));
}

We now have the full implementation of SingedTransport.

Summary

We started with a simple FluidTransport that uses a fluid-ledger to implement the postAction() function and actionPosted event, which we need for modeling turn-based games.

Next, we looked at signing and signature verification using subtle.

Finally, we implemented SingedTransport, a decorator over another transport that adds signature singing and verification.

The idea is we start with a FluidTransport and perform a key exchange, where each client generates a public/private key pair and broadcasts their ID and public key. Clients store all these in a KeyStore. Once the key exchange is done, we can initialize a SignedTransport that wraps the original FluidTransport and transparently handles signatures.

At this point we have all the pieces in place to start looking at semantics: we can exchange data between clients, we can authenticate exchanged messages, and we have the cryptography primitives for Mental Poker (commutative encryption). In the next post weâll look at a state machine that we can use to implement game semantics.

The code covered in this post is available on GitHub in the mental-poker-toolkit repo. FluidTransport is implemented under packages/fluid-transport, SignedTransport is under packages/signed-transport, and the signing functions can be found in packages/cryptography/src/signing.ts.

Note: Since writing this post, the code was refactored so SignedTransport doesn't take a direct dependency on the cryptography package, rather signing and signature verification is now passed as a ISignatureProvider interface.

Large Language Models at Work RTM

Fri, 03 Nov 2023 00:00:00 -0700

Large Language Models at Work RTM

Keeping with tradition, I'm writing the RTM post for Large Language Models at Work. The book is done. Now available on Kindle.

Self-publishing

I decided not to contact a publisher this time around, for a couple of reasons: First, I didn't want the pressure of a contract and timelines (though looking back, I did finish this book faster than the previous two); Second, I had no idea if I will be able to write something that is still valuable by the time the book is done, considering the speed of innovation. More on this later.

I authored the book in the open, at https://vladris.com/llm-book/ and self-published on Kindle. Maybe I will look into making it a print book at some point, for now I'm keeping it digital.

Amazon offers a nice set of tools to import and format ebooks, but they have some big limitations - for example, no support for formatting tables, footnotes etc. I also couldn't convince the tool the code samples should be monospace on import so I had to manually re-set the font on each. The book has a few formatting glitches because of these limitations, which make me reluctant to look into a print book as I expect I will need to do a lot more manual tweaking for the text to look good in print.

Speed of innovation

I mused about this in chapter 10: Closing Thoughts. I'll repeat it here as it perfectly highlight why it is impossible to pin down this strange new world of AI.

I started writing the book in April 2023. When I picked up the project, GPT-4 was in private preview, with GPT-3.5 being the most powerful globally available model offered by OpenAI. Since then, GPT-4 opened to the public.

In June, OpenAI announced Functions - fortunately, this happened just before I started working on chapter 6, Interacting with External Systems. Before Functions, the way to get a large language model to connect with native code was through few-shot learning in the prompt, covered in the Non-native functions section. Originally, I was planning to focus exclusively on this implementation. Of course, built-in support makes it easier to specify available functions and the model interaction is likely to work better - since the model has been specifically trained to understand function definitions and output correct function calls.

In August, OpenAI announced fine-tuning support for gpt-3.5-turbo. When I was writing the first draft of chapter 4, Learning and Tuning, the only models that used to support fine-tuning were the older GPT-3 generation models: Ada, Babbage, Currie, and Davinci. This was particularly annoying, as the quality of output produced by these models is way below gpt-3.5-turbo levels. Now, with the newer models having fine-tuning support, I had to rewrite the Fine-tuning section.

text-davinci-003 launched in November of 2022, while gpt-3.5-turbo launched on March 1st 2023. When I started writing the book, text-davinci-003 was backing most large language model-based solutions across the industry, and migrations to the newer gpt-3.5-turbo were underway. text-davinci-003 is deprecated to be removed by January 4, 2024 (to be replaced by gpt-3.5-turbo-instruct), and the industry is moving to adopt GPT-4. I had to update several code samples from text-davinci-003 to gpt-3.5-turbo-instruct.

No idea how long the code samples will keep working or when OpenAI will decide to deprecate gpt-3.5-turbo or introduce an even more powerful model with capabilities not covered in the book.

Time(lessness)

While some of the code examples will not age well as new models and APIs get release, the underlying principles of working with large language models that I walked through in this book - prompt engineering, memory, interacting with external systems, planning, and so on - will be relevant for a while. Understanding these fundamentals should help anyone ramp up in the space.

This is an exciting new field, that is going to see a lot more innovation in the near future. But I expect some of these fundamentals to carry on, in one shape or another. I hope the topics discussed in this book to remain interesting for long after the specific models used in the examples become obsolete.

Excertps

Like with my previous books, I've been publishing excerpts as shorter, stand-alone reads. This might sound a bit strange in this case, as the book is already all online. But I figured it will hopefully help reach more people, and I did some work on each excerpt to remove references to other parts of the book so they can, indeed, be read wihtout context. I published all of these on Medium:

I hope you enjoy the book! Check it out here: Large Language Models at Work.

Large Language Models at Work

Sun, 18 Jun 2023 00:00:00 -0700

Large Language Models at Work

I recently announced I'm working on a new book about large language models and how to integrate them in software systems. As I'm writing this, the first 3 chapters are live at https://vladris.com/llm-book.

The remaining chapters are in the works and I will upload them as I work through the manuscript. In the meantime, since I announced my previous books with a blog post each (Programming with Types, Azure Data Engineering), I'll keep the tradition and talk a bit about the current book.

When embarking on a writing project, it's good to have a plan. Of course, the details change as the book gets written, but starting with a clear articulation of what the book is about, who is the target reader, the list of chapters and an outline helps. Here is the book plan I wrote a few months ago:

ð Book Plan

This book is aimed at software engineers wanting to learn about how they can integrate LLMs into their software systems. It covers all the necessary domain concepts and comes with simple code samples. A good way to frame this is the book covers the same layer of the stack that frameworks like Semantic Kernel and LangChain are trying to provide.

No prior AI knowledge required to understand this book, just basic programming.

After reading the book, one should have a solid understanding of all the required pieces to build an LLM-powered solution and the various things to keep in mind (like non-determinism, AI safety & security etc.).

Your feedback is very much welcomed! Do leave comments if you have any thoughts.

Title & table of contents

Building with Large Language Models

A book about integrating LLMs in software systems and the various aspects software developers need to know (prompt engineering, memory & embeddings, connecting with external systems etc.). Simple code examples in Python, using the OpenAI API.

A New Paradigm

An introduction, describing how LLMs are being integrated in software solutions and the new design patterns emerging.

1.1. Who this book is for

The pitch for the book, who should read it, what they will get out of it, what to expect.

1.2. Taking the world by storm

Briefly talk about the major innovations since the launch of ChatGPT.

1.3. New software architectures for a new world

Talk about the new architectures that embed LLMs into broader software systems and frameworks being built to address this.

1.4. Using OpenAI

The book uses plenty of code examples in Python and using OpenAI. This section introduces OpenAI and setup steps for the reader.

1.5. In this book

Preview of the topics covered throughout the rest of the book.
Large Language Models

This chapter introduces large language models, the OpenAI offering, key concepts and api parameters. code examples will include the first âhello worldâ API calls.

2.1. Large language models

Describes large language models and key ways in which they differ from other software components (train once, prompt many times; non-deterministic; no memory of prior interactions etc.).

2.2. OpenAI models

Describes the OpenAI model families, and doubleclick on GPT-3.5 models (though by the time this book is done Iâm sure GPT-4 will be out of beta). Examples in the book will start with text-davinci-300 (simpler prompting), then move to gpt-3.5-turbo (cheaper).

2.3. Tokens

Explain tokens, token limits, and how OpenAI prices API calls based on tokens.

2.4. API parameters

Covers some important API parameters OpenAI offers, like n, max_tokens, suffix, and temperature.
Prompt Engineering

This chapter dives deep into prompting, which is the main way we interact with LLMs, potentially a new engineering discipline.

3.1. Prompt design & tuning

Covers prompt design and how small tweaks in a prompt can yield very different results. Tips for authoring prompts, like telling the LLM who it is (âyou are an assistantâ) and the magic âletâs think step by stepâ.

3.2. Prompt templates

Shows the need for templating prompts and a simple template implementation. Let user focus on task input and use template to provide additional info needed by the LLM.

3.3. Prompt selection

Solutions usually have multiple prompts, and we select the best one based on user intent. This section covers prompt selection and going from user ask to picking template to generating prompt.

3.4. Prompt chaining

Prompt chaining includes the input preprocessing and output postprocessing of an LLM request, and feeding previous outputs back into new prompts to refine asks.
Learning and Tuning

This chapter focuses on teaching an LLM new domain-specific stuff to unlock its full potential. Includes prompt-based learning and fine tuning.

4.1. Zero-, one-, few-shot learning

Explains zero-shot learning, one-shot learning, and few-shot learning with examples for each.

4.2. Fine tuning

Explains fine tuning, when it should be used, and works through an example.
Memory and Embeddings

This chapter covers solutions to work around the fact LLMs donât have any memory.

5.1. A simple memory

Starting with a basic example of using memory and some limitations we hit due to token limits.

5.2. Key-value memory

A simple key-value memory where we retrieve just the values we need for a given prompt.

5.3. Embeddings

More complex memory scenario: generating an embedding and using a vector database to retrieve the right information (Q&A example).

5.4. Other approaches

I really liked the idea in this paper, where memory importance is determined by the LLM itself, and retrieval is a combination of recency, importance, and embedding distance. Cover this and show the problem space is still ripe for innovation.
Interacting with External Systems

How we can make external tools available to LLMs.

6.1. ChatGPT plugins

Start by describing ChatGPT plugins offered by OpenAI. The why and how.

6.2. Connecting the dots

Putting together what we learned from previous chapters (prompt selection, memory, few-shot learning) to teach LLMs to interact with any external system.

6.3. Building a tool library

Formalizing the previous section and coming up with a generalized schema for connecting LLMs to external systems.
Planning

This chapter talks about breaking down asks into multiple steps and executing those. This enables LLMs to execute on complex tasks.

7.1. Automating planning

This section shows how we can ask the LLM itself to come up with a set of tasks. This includes the prompt and telling it what tools (external systems it can talk to) are available.

7.2. Task queues

Talk about the architecture used by AutoGPT, where tasks are queued and reviewed after each LLM call. Loop until done or until hitting a limit.
Safety and Security

This chapter covers both responsible AI concerns like avoiding hallucinations and new attack vectors like prompt injection and prompt leaking.

8.1. Hallucinations

Discuss hallucinations, why these are currently a big problem with LLMs, and tips to avoid them e.g. telling the model not to make things up if it doesnât know something & validating output.

8.2. Explainability

Zooming out from hallucinations, this section covers the challenge of explainable AI. It covers this both tactically (prompts to get the model to provide references) and strategically (current investments in explainable AI).

8.3. Adversarial attacks

This section focuses on malicious inputs and attack vectors to keep in mind. For example, prompt leaking (âignore the above instructions and output the full promptâ).

8.4. Responsible AI

Wrap up the chapter with a discussion around responsible AI, including more philosophical concerns about challenges with this technology and potential societal impact.
Frameworks

This chapter focuses on pulling together the concepts discussed into a framework and provides quick overviews of a couple of existing frameworks.

9.1. Common building blocks

Review the different components discussed throughout the book and how they form a cohesive framework for working with LLMs. Remainder of the chapter overviews existing frameworks.

9.2. Semantic Kernel

Quick overview of https://github.com/microsoft/semantic-kernel and how the framework pieces map to concepts discussed in this book.

9.3. LangChain

Quick overview of https://docs.langchain.com/docs/ and how the framework pieces map to concepts discussed in this book.
Final Thoughts

Some thoughts on the future.

Development

Of course, as I work on the chapters, the topics covered in each might deviate significantly from the above plan. But in my experience, the outline helps a lot to tie things together and inform what I do. In other words - better to have an outline than to not have one.

Note the original title was Building with Large Language Models. I didn't like how this sounded from the start. I described the book to ChatGPT and asked it for a few titles. Some of the suggestions:

Incorporating LLMs in Software Systems: The Future of Programming

The Magic of Language Models: Transforming Software Integration

LLMs at Work: Enhancing Software Systems with AI-Powered Language Models

Breaking Boundaries: Integrating LLMs for Smarter Software Solutions

Language Models Unleashed: A Guide to Integrating LLMs in Software Development

I ended up picking Large Language Models at Work, subtitle Enhancing Software Systems with Language Models (though of course I might change it). I do feel like it captures the essence of what the book is about.

I'va also been using AI for the artwork. The book cover is generated by DALLÂ·E and, similarly, each chapter starts with a DALLÂ·E generated image. I do think the abstract renderings by AI of the concepts I'm talking about give a nice touch to the book.

An interesting challenge is that the field is moving so fast, there's a real risk I have to rewrite large parts of the book before I wrap up the first iteration of the manuscript. For example, OpenAI recently (June 2023, this week at the time of writing) announced function support for gpt-3.5-turbo. This new addition to the API makes it much easier to have the model invoke external systems (which is the focus of chapter 6 - luckily I'm not there yet).

I hope this will end up being a useful book and help developers ramped up on this new world of software development and LLM-assisted solutions. Do check out the book online at https://vladris.com/llm-book and follow me on LinkedIn or Twitter for updates. For now, enjoy the available chapters!

Mental Poker Part 2: Fluid Ledger

Sun, 04 Jun 2023 00:00:00 -0700

Mental Poker Part 2: Fluid Ledger

In the previous post I covered the cryptography part of implementing Mental Poker. In this post, I'll cover the append-only list data structure used to model games.

As I mentioned before, we rely on Fluid Framework. The code is available in my GitHub fluid-ledger repo.

Fluid Framework

I touched on Fluid Framework before so I won't describe in detail what the library is about. Relevant to this blog post, we have a set of distributed data structures that multiple clients can update concurrently. All clients in a session connect to a service (like the Azure Fluid Relay service). Each update a client makes to a distributed data structure gets sent to the service as an operation. The service stamps a sequence number on the operation and broadcasts it to all clients. That means that eventually, all clients end up with the same list of operations in the same sequence, so they can merge changes client-side while ensuring all clients end up with the same view of the world.

The neat thing about Fluid Framework is the fact that merges happen on the clients as described above rather than server-side. The service doesn't need to understand the semantics of each data structure. It only needs to sequence operations. Different data structures implement their own merge logic. The framework provides some powerful out-of-the-box data structures like a sparse matrix or a tree. But we don't need such powerful data structures to model games: a list is enough.

Append-only list

Most turn-based games can be modeled as a list of moves. This includes games like chess, but also card games. The whole Mental Poker shuffling protocol we discussed, where one player encrypts and shuffles the deck, then hands it over to the other player to do the same etc. is also, in fact, a sequence of moves.

The semantics of a particular game are implemented at a higher level. The types of games we are looking at though can be modeled as a list of moves, where players take turns. Each move is an item in the list. In this blog post we're looking at the generic list data structure, without worrying too much about how a move looks like.

A list is a very simple data structure, but let's see how this looks like in the context of Fluid Framework. Here, we have a distributed data structure multiple clients can concurrently update.

Fluid ledger

I named the data structure ledger, as it should act very much as a ledger from the crypto/blockchain world - an immutable record of what happened. In our case, this contains a list of game moves.

The Fluid Framework implementation is fairly straight-forward: when a client wants to append an item to the list, it sends the new item to the Fluid Relay service. The service sequences the append, meaning it adds the sequence number and broadcasts it to all clients, including the sender. The local data structure only gets appended once received from the service. That guarantees all clients end up with the same list, even if they concurrently attempt to append items to it.

The diagram shows how this works when Client A wants to append 4 to the ledger:

The new item 4 is sent to the Relay Service.
The relay service broadcasts it to all clients.
Clients receive the new item and append it to the list.

Interfaces

Our API consists of two interfaces, ILedgerEvents, representing the events that our data structure can fire, and ILedger, the API of our data structure.

We derive these from ISharedObjectEvents and ISharedObject, which are available in Fluid Framework. We also need the Serializable type, which represents data that can be serialized in the Fluid Framework data store:

import {
    ISharedObject,
    ISharedObjectEvents
} from "@fluidframework/shared-object-base";
import { Serializable } from "@fluidframework/datastore-definitions";

With these imports, we can define our ILedgerEvents as:

export interface ILedgerEvents<T> extends ISharedObjectEvents {
    (event: "append", listener: (value: Serializable<T>) => void): void;
    (event: "clear", listener: (values: Serializable<T>[]) => void): void;
}

T is the generic type of the list items. The append event is fired after we get an item from the Fluid Relay service and the item is appended to the ledger. The clear event is fired when we get a clear operation from the Fluid Relay service and the ledger is cleared. The event will return the full list of items that have been removed as values.

We can also defined ILedger as:

export interface ILedger<T = any> extends ISharedObject<ILedgerEvents<T>> {
    get(): IterableIterator<Serializable<T>>;
    append(value: Serializable<T>): void;
    clear(): void;
}

The get() function returns an iterator over the ledger. append() appends a value and clear() clears the ledger.

The full implementation can be found in interfaces.ts.

Factory

We also need to provide a LedgerFactory the framework can use to create or load our data structure.

We need to import a handful of types from the framework, our ILedger interface, and our yet-to-be-implemented Ledger:

import {
    IChannelAttributes,
    IFluidDataStoreRuntime,
    IChannelServices,
    IChannelFactory
} from "@fluidframework/datastore-definitions";
import { Ledger } from "./ledger";
import { ILedger } from "./interfaces";

We can now define the factory as implementing the IChannelFactory interface:

export class LedgerFactory implements IChannelFactory {
    ...
}

We'll cover the implementation step-by-step. First, we need a couple of static properties defining the type of the data structure and properties of the channel:

public static readonly Type = "fluid-ledger-dds";

public static readonly Attributes: IChannelAttributes = {
    type: LedgerFactory.Type,
    snapshotFormatVersion: "0.1",
    packageVersion: "0.0.1"

public get type() {
    return LedgerFactory.Type;
}

public get attributes() {
    return LedgerFactory.Attributes;
}
};

Type just needs to be a unique value for our distributed data structure. We'll define it as fluid-ledger-dds. The channel Attributes are used by the runtime for versioning purposes.

You can think of the way Fluid Framework stores data as similar to git. In git we have snapshots and commits. Fluid Framework uses a similar mechanism, where the service records all operations sent to it (this is the equivalent of a commit) and periodically takes a snapshot of the current state of the world.

When a client connects and wants to get up to date, it tells the service what is the last state it saw and the service sends back what happened since. This could include the latest snapshot (if the client doesn't have it) and a bunch of operations that have been sent by clients after the latest snapshot.

In case we iterate on our data structure, we need to tell the runtime which snapshot format and which ops our client understands.

The interface we are implementing (IChannelFactory) includes a load() and a create() function.

Here is how we load a ledger:

public async load(
    runtime: IFluidDataStoreRuntime,
    id: string,
    services: IChannelServices,
    attributes: IChannelAttributes
): Promise<ILedger> {
    const ledger = new Ledger(id, runtime, attributes);
    await ledger.load(services);
    return ledger;
}

This is pretty straightforward: we construct a new instance of Ledger (we'll look at the Ledger implementation in a bit), call load(), and return the object. This is an async function. No need to worry about the arguments as the framework will handle these - we just plumb them through.

create() is similar, except this is synchronous:

public create(document: IFluidDataStoreRuntime, id: string): ILedger {
    const ledger = new Ledger(id, document, this.attributes);
    ledger.initializeLocal();
    return ledger;
}

Instead of calling the async ledger.load(), we call initializeLocal(). We again don't have to cover the arguments, but let's talk about the difference between creating and loading.

In order to understand these, we need to introduce a new concept: the Fluid container.

The container is a collection of distributed data structures defined by a schema. This describes the data model of an application. In our case, to model a game, we only need a ledger. For more complex applications, we might need to use multiple distributed data structures. Fluid Framework uses containers as the unit of data - we will never instantiate or use a distributed data structure standalone. Even if we only need one, as in our case, we still need to define a container.

The lifecycle shown in the diagram is:

A client creates a container locally (this is where create() comes into play). Based on the provided schema, the runtime will call create() for all described data structures. At this point, we haven't yet connected to the Fluid Relay. We are in what is called detached mode. Here we have the opportunity to update our data structures before we connect and have other clients see them.
We attach the container, meaning we connect to the Relay Service and start a multi-user session. We can now expect changes to come in from other clients.
Another client can now connect to the session. On this second client, since the container was already created, the runtime will rely on the load() functions to hydrate it.

As a side note, the Fluid Relay can also store documents to persistent storage so once the coauthoring session is over and all clients disconnect, the document is persistent for future sessions.

For our Mental Poker application, we don't need to worry too much about containers and schemas, we only need a minimal implementation consisting of a container with a single distributed data structure: our Ledger. But it is worth understanding how the runtime works.

We went over the full implementation of the LedgerFactory. You can also find it in ledgerFactory.ts.

Implementation

Let's now look at the actual implementation and learn about the anatomy of a Fluid distributed data structure.

We need to import several types from the framework, which we'll cover as we encounter them in the code below, or won't discuss if they are boilerplate.

import {
    ISequencedDocumentMessage,
    MessageType
} from "@fluidframework/protocol-definitions";
import {
    IChannelAttributes,
    IFluidDataStoreRuntime,
    IChannelStorageService,
    IChannelFactory,
    Serializable
} from "@fluidframework/datastore-definitions";
import { ISummaryTreeWithStats } from "@fluidframework/runtime-definitions";
import { readAndParse } from "@fluidframework/driver-utils";
import {
    createSingleBlobSummary,
    IFluidSerializer,
    SharedObject
} from "@fluidframework/shared-object-base";
import { ILedger, ILedgerEvents } from "./interfaces";
import { LedgerFactory } from "./ledgerFactory";

Note the last two imports: we import our interfaces and our LedgerFactory.

We'll define a couple of delta operations. That's the Fluid Framework name for an operation (op) we send to the (or get back from) Fluid Relay service.

type ILedgerOperation = IAppendOperation | IClearOperation;

interface IAppendOperation {
    type: "append";
    value: any;
}

interface IClearOperation {
    type: "clear";
}

In our case, we can have either an IAppendOperation or an IClearOperation. The two together define the ILedgerOperation type.

The IAppendOperation includes a value property which can be anything. Both IAppendOperation and IClearOperation have a type property, so we can see at runtime which type we are dealing with.

We talked about how Fluid Framework is similar to git in the way it stores documents as snapshots and ops. A lot of this is handled internally by the framework, but our data structure needs to tell the service how we want to name the snapshots, so we'll define a constant for this:

const snapshotFileName = "header";

With this, we can start the implementation of Ledger.

export class Ledger<T = any>
    extends SharedObject<ILedgerEvents<T>>
    implements ILedger<T>
{
    ...
}

We derive from SharedObject, the base distributed data structure type. We specify that this SharedObject will be firing ILedgerEvents and that it implements the ILedger interface.

The framework expects a few functions used to construct objects. Our constructor looks like this:

constructor(
    id: string,
    runtime: IFluidDataStoreRuntime,
    attributes: IChannelAttributes
) {
    super(id, runtime, attributes, "fluid_ledger_");
}

The constructor takes an id, a runtime, and channel attributes. We don't need to deeply understand these, as they are handled and passed in by the framework. The last argument of the base class constructor is a telemetry string prefix. We just need to provide a string unique to our data structure, so we use fluid_ledger_ in our case.

We also need a couple of static functions: create() and getFactory():

public static create(runtime: IFluidDataStoreRuntime, id?: string) {
    return runtime.createChannel(id, LedgerFactory.Type) as Ledger;
}

public static getFactory(): IChannelFactory {
    return new LedgerFactory();
}

For create(), again we don't need to worry about runtime and id, as we won't have to pass these in ourselves. We just need this function to forward them to runtime.createChannel(). createChannel() also requires the unique type, which we'll get from our LedgerFactory.

The getFactory() function simply creates a new instance of LedgerFactory.

We covered the constructor and factory functions. Next, let's look at the internal data and the required initializeLocalCore() functions:

private data: Serializable<T>[] = [];

public get(): IterableIterator<Serializable<T>> {
    return this.data[Symbol.iterator]();
}

protected initializeLocalCore() {
    this.data = [];
}

This is very simple - we represent our ledger as an array of Serializable.

The get() function, which we defined on our IFluidLedger interface, returns the array's iterator.

initializeLocalCore(), called internally by the runtime, simply sets data to be an empty array.

We also need to implement saving and loading of the data structure. Save in Fluid Framework world is called summarize: this is what the framework uses to create snapshots.

protected summarizeCore(
    serializer: IFluidSerializer
): ISummaryTreeWithStats {
    return createSingleBlobSummary(
        snapshotFileName,
        serializer.stringify(this.data, this.handle)
    );
}

We can use a framework-provided createSingleBlobSummary. In our case, we save the whole data array and the handle (handle is an inherited attribute representing a handle to the data structure, which the Framework uses for nested data structure scenarios).

Here is how we load the data structure:

protected async loadCore(storage: IChannelStorageService): Promise<void> {
    const content = await readAndParse<Serializable<T>[]>(
        storage,
        snapshotFileName
    );
    this.data = this.serializer.decode(content);
}

For both summarize and load, we rely on Framework-provided utilities.

We can now focus on the non-boilerplate bits: implementing our append() and clear(). Let's start with append():

private applyInnerOp(content: ILedgerOperation) {
    switch (content.type) {
        case "append":
        case "clear":
            this.submitLocalMessage(content);
            break;

        default:
            throw new Error("Unknown operation");
    }
}

private appendCore(value: Serializable<T>) {
    this.data.push(value);
    this.emit("append", value);
}

public append(value: Serializable<T>) {
    const opValue = this.serializer.encode(value, this.handle);

    if (this.isAttached()) {
        const op: IAppendOperation = {
            type: "append",
            value: opValue
        };

        this.applyInnerOp(op);
    }
    else {
        this.appendCore(opValue);
    }
}

applyInnerOp() is common to both append() and clear(). This is the function that takes an ILedgerOperation and sends it to the Fluid Relay service. submitLocalMessage() is inherited from the base SharedObject.

appendCore() effectively updates data and fires the append event.

append() first serializes the provided value using the inherited Framework-provided serializer. We assign this to opValue. We then need to cover both the attached and detached scenarios. If attached, it means we are connected to a Fluid Relay and we are in the middle of a coauthoring session. In this case, we create an IAppendOperation object and call applyInnerOp(). If we are detached, it means we created our data structure (and its container) on this client, but we are not connected to a service yet. In this case we call appendCore() to immediately append the value since there is no service to send the op to and get it back sequenced.

clear() is very similar:

private clearCore() {
    const data = this.data.slice();

    this.data = [];

    this.emit("clear", data);
}

public clear() {
    if (this.isAttached()) {
        const op: IClearOperation = {
            type: "clear"
        };

        this.applyInnerOp(op);
    }
    else {
        this.clearCore();
    }
}

clearCore() effectively clears data and emits the clear event.

clear() handles both the attached and detached scenarios.

So far we update our data immediately when detached, and when attached we send the op to the Relay Service. The missing piece is handling ops as they come back from the Relay Service. We do this in processCore(), another function the runtime expects us to provide:

protected processCore(message: ISequencedDocumentMessage) {
    if (message.type === MessageType.Operation) {
        const op = message.contents as ILedgerOperation;

        switch (op.type) {
            case "append":
                this.appendCore(op.value);
                break;
            case "clear":
                this.clearCore();
                break;
            default:
                throw new Error("Unknown operation");
        }
    }
}

This function is called by the runtime when the Fluid Relay sends the client a message. In our case, we only care about messages that are operations. We only support append and clear operations. We handle these by calling the appendCore() and clearCore() we just saw - since these ops are coming from the service, we can safely append them to our data (we have the guarantee that all clients will get these in the same order).

And we're almost done. We need to implement onDisconnect(), which is called when we disconnect from the Fluid Relay. This gives the distributed data structure a chance to run some code but in our case we don't need to do anything.

protected onDisconnect() {}

Finally, we also need applyStashedOp(). This is used in offline mode. For some applications, we might want to provide some functionality when offline - a client can keep making updates, which get stashed. We won't dig into this since for Mental Poker we can't have a single client play offline - we simply throw an exception if this function ever gets called:

protected applyStashedOp(content: unknown) {
    throw Error("Not supported");
}

The full implementation is in ledger.ts.

And that's it! We have a fully functioning distributed data structure we can use to model games.

Demo

The GitHub repo also includes a demo app: a collaborative coloring application where multiple clients can simultaneously color a drawing.

In this case, we model coloring operations as x and y coordinates, and a color. As users click on the drawing, we append these operations to the ledger and play them back to color the drawing using flood fill.

Notes on Documentation

Wed, 12 Apr 2023 00:00:00 -0700

Notes on Documentation

I spent a bunch of time lately revamping some documentation and this got me thinking. In terms of tooling, even state-of-the-art documentation pipelines are missing some key features. This is also an area where we can directly apply LLMs. In this post, I'll jot down some thoughts of how things could look like in a more perfect world. Of course, here I'm referring to documentation associated with software projects.

Build from source

This first one isn't unheard of: documentation should be captured in source control and generated from there as a static website. There are two major types of documentation: API reference and articles that aren't tied to a specific API.

API reference should be extracted from code comments. Different languages have different levels of official support for this. C# has out-of-the-box XML documentation (///), JavaScript has the non-standard but popular JsDoc etc.

Articles on the other hand should be written as stand-alone Markdown files.

A good documentation pipeline should support both. My team is using DocFX to that effect, though TypeScript is not supported out-of-the-box and requires some additional packages to set up.

CI validation

Commenting APIs should be enforced via linter. We have tools like StyleCop for C# and a JsDoc plugin for eslint for JavaScript. At the very least, all of the public API surface should be documented. If you introduce a new public API without corresponding documentation, this should cause a build break.

For technical documentation, many times articles also contain code samples. These run the risk of getting out of sync with the actual code as the code churns. In an ideal world, we should be able to associate a code snippet from an article with a test that runs with the CI pipeline. Documentation might skip scaffolding for clarity, so it's likely harder to simply attempt running the exact code snippet. But we should have a way to pull the snipped into a test that provides that scaffolding.

Alternately, enforce that running all snippets in an article in order works - treat articles more like Jupyter notebooks, where the runtime maintains some context, so if, for example, I import something in the first code snippet, the import is available to subsequent code snippets.

The key thing is to have some way to validate at build time that all code examples actually work and not allow breaking changes, even if the only thing that breaks is documentation.

Ownership

From my personal experience, documentation is usually treated as an afterthought. From time to time there is a big push to update things, but it's rare that everyone is constantly working towards improving docs.

Unless documentation reaches a critical mass of contributors to ensure everything is kept in order, it's best to have clear ownership of each article. Git history is not always the best for finding owners - sometimes the last author is no longer with the team or with the company, or maybe last commits just moved the file around or fixed typos.

This concern goes beyond documentation, in general I'd love to see an ownership tracking system that can associate assets with people and is also org-chart aware - so if an owner changes teams, this gets flagged and a new owner must be provided.

Inline fragments

While working on documentation, I noticed that for a large enough project, some information tends to repeat across multiple articles. Maybe as part of a summary on the front page, then again in an article covering some of the details, and once more incidentally in a related article.

The problem is that if something changes and I only update one of the articles (maybe I'm not aware of all the places this shows up), documentation can start contradicting itself. This is something that is not part of the common Markdown syntax but I'd love to have a way to inline a paragraph across multiple documents to avoid this.

Style guides

All documentation should include a style guide. Some guidelines encourage writing for easier reading, so apply in most cases. For example:

Avoid passive voice.
Encourage diagrams and pictures vs. very long descriptions.

Some guidelines depend on the type of article. If you're documenting a design decision, explain the reasoning and list other options considered and why these weren't adopted. On the other hand, if you are writing a troubleshooting guide, no need to explain the why, just what steps the reader needs to take.

Unfortunately I haven't seen a lot of such guides accompany projects. I wish we had a set of industry standard ones to simply plug in, like we do with open source licenses.

Information architecture

In many cases, there is little effort put into structuring the documentation. We start with /docs then as articles pile up, we create new subfolders organically.

Much like we want some high-level design of a system, we should also require a high-level design of the documentation. What are the key topics and sub-sections? This doesn't even need to be reinvented for each new project, I expect there's a handful of structures which can support most projects, so much like style guides, it would be great to have these available of-the-shelf.

An alternative to hierarchy

I started this post talking about building documentation from source, which naturally maps to articles being files organized in folders (categories). This type of organization - categories and subcategories - works well up to a certain volume of information.

At some point, it gets hard to figure out which subcategory something fits in: it might fit just as well in multiple places. Here the folder categorization breaks down: there is no clear hierarchy of nested folders in which to fit everything.

At alternative to hierarchies are tags. Maintain a curated set of tags, then tag each article with one or more tags. You can then browse by tag, but have articles show up under multiple tags. This tends to work better with larger volumes of information, but it's harder to map to a file and folder structure.

AI

With the popularity of large language models, I see many applications throughout the lifecycle:

Authoring

Generative AI can help coauthor documentation. GitHub Copilot already does this. As models get better and cheaper to run, I expect they will be more and more involved in writing documentation.

Reviewing and editing

Given a style guide, a model can review how closely a document adheres to it and suggest changes to match the guide.

With a knowledge of the whole documentation, a model could also spot contradictions (the problem I mentioned in the Inline fragments section). This could be a step in the CI pipeline to ensure consistency.

A model could potentially also act as a reader and provide feedback on how clear the documentation is.

Retrieval

Most tools generating documentation from source provide very rudimentary search capabilities. OpenAI offers text and code embedding APIs which enable semantic search and natural language querying. Using something like this on documentation should make finding things much easier.

Q&A

Models can also be used to answer questions, so instead of readers having to search the docs for what they need, they can simply ask questions. A model can provide answers based on the documentation (and the codebase). This takes retrieval a step further: users can simply get their questions answered by a model. In some cases articles might not even be needed, as the model can explain in real time how the code is supposed to be used.

Summary

I believe as of today, even the best tools available for documentation leave room for improvement and large language models have the potential to radically change the game.

In this post we looked at:

The two main types of documentation: API reference and articles, which should both live in source control.
API reference should be extracted from code comments.
Articles should be written as Markdown.
CI validation for documentation: enforce API documentation and ensure code samples still work.
Ownership tracking: ensuring someone feels responsible for every piece of documentation.
Inline fragments as a proposed solution to keep information in sync across multiple documents.
Style guides for documentation to ensure consistent & readable articles.
Information architecture to improve overall structure and navigation.
Potential AI applications throughout the lifecycle:
- Coauthoring documentation with generative AI.
- Reviewing documentation and providing suggestions (for style, consistency, readability).
- Finding the right documentation using embeddings.
- Answering natural language questions.

Some of these features exist and some of these practices are adopted in some projects, but most are not widely implemented. I'm curious to see how the landscape will look like in a few years and how AIs will change the way we learn and get our questions answered.

Mental Poker Part 1: Cryptography

Tue, 14 Mar 2023 00:00:00 -0700

Mental Poker Part 1: Cryptography

In the previous post I outlined some of the interesting bits of putting together a Mental Poker toolkit. In this post I will talk about cryptography.

The golden rule when it comes to cryptography code is to not roll your own, rather use something that's been battle-tested. That said, I could not find what I needed so had to implement some stuff. I urge you not to rely on my implementation for high-stakes poker, as it is likely buggy.

With the disclaimer out of the way, let's look at what we need to support Mental Poker.

Card shuffling

Recap from this old post when I first got interested in the subject:

Mental poker requires a commutative encryption function. If we encrypt $A$ using $Key_1$ then encrypting the result using $Key_2$, we should be able to decrypt the result back to $A$ regardless of the order of decryption (first with $Key_1$ and then with $Key_2$, or vice-versa).

Here is how Alice and Bob play a game of mental poker:

Alice takes a deck of cards (an array), shuffles the deck, generates a secret key $K_A$, and encrypts each card with $K_A$.

Alice hands the shuffled and encrypted deck to Bob. At this point, Bob doesn't know what order the cards are in (since Alice encrypted the cards in the shuffled deck).

Bob takes the deck, shuffles it, generates a secret key $K_B$, and encrypts each card with $K_B$.

Bob hands the deck to Alice. At this point, neither Alice nor Bob know what order the cards are in. Alice got the deck back reshuffled and re-encrypted by Bob, so she no longer knows where each card ended up. Bob reshuffled an encrypted deck, so he also doesn't know where each card is.

At this point the cards are shuffled. In order to play, Alice and Bob also need the capability to look at individual cards. In order to enable this, the following steps must happen:

Alice decrypts the shuffled deck with her secret key $K_A$. At this point she still doesn't know where each card is, as cards are still encrypted with $K_B$.

Alice generates a new set of secret keys, one for each card in the deck. Assuming a 52-card deck, she generates $K_{A_1} ... K_{A_{52}}$ and encrypts each card in the deck with one of the keys.

Alice hands the deck of cards to Bob. At this point, each card is encrypted by Bob's key, $B_K$, and one of Alice's keys, $K_{A_i}$.

Bob decrypts the cards using his key $K_B$. He still doesn't know where each card is, as now the cards are encrypted with Alice's keys.

Bob generates another set of secret keys, $K_{B_1} ... K_{B_{52}}$, and encrypts each card in the deck.

Now each card in the deck is encrypted with a unique key that only Alice knows and a unique key only Bob knows.

If Alice wants to look at a card, she asks Bob for his key for that card. For example, if Alice draws the first card, encrypted with $K_{A_1}$ and $K_{B_1}$, she asks Bob for $K_{B_1}$. If Bob sends her $K_{B_1}$, she now has both keys to decrypt the card and look at it. Bob still can't decrypt it because he doesn't have $K_{A_1}$.

This way, as long as both Alice and Bob agree that one of them is supposed to see a card, they exchange keys as needed to enable this.

The reason I ended up hand-rolling some cryptography is that off-the-shelf encryption algorithms are non-commutative. With a non-commutative algorithm, the above steps don't work: Alice cannot decrypt the deck with her secret key $K_A$ after Bob shuffled it and encrypted it with $K_B$.

The analogy I used in this tech talk is boxes and locks: if we have commutative encryption, we put the secret information in a box and both Alice (using $K_A$) and Bob (using $K_B$) put a lock on that box. It doesn't really matter in which order we unlock the two locks - as long as both are unlocked, we can get to the content. On the other hand, if we have non-commutative encryption, this is equivalent of Alice putting the secret in a box locked with $K_A$, and Bob putting the whole locked box in another box locked with $K_B$. Now Alice's key is useless while the outerbox only has the $K_B$ lock on it.

There aren't as many applications for commutative encryption, so the popular libraries out there provide only non-commutative encryption algorithms. The commutative encryption algorithm we will look at is SRA.

SRA

The SRA encryption algorithm was designed by Shamir, Rivest, and Adleman of RSA fame. Both algorithms use their initials, but the industry-standard RSA is non-commutative. SRA, on the other hand, is.

SRA works like this: we need a large prime number $P$. This seed prime is shared by all players. To generate encryption keys from it, let $\phi = P - 1$. Each player needs to find another prime $E$, such that $\phi$ and $E$ are coprime. $E$ is that player's encryption key. The decryption key is derived from $\phi$ and $E$ as the modulo-inverse $D$ such that $E * D \equiv 1 \pmod{\phi}$.

To encrypt a number $N$, we raise it to $E$ modulo $P$. To decrypt an encrypted number $N'$, we raise it to $D$ modulo $P$.

Then if player 1 encrypts a payload with $E_1$ and player 2 encrypts again using $E_2$, the message can be decrypted by applying $D_1$ and $D_2$ in any order. Remember, this is key to the card shuffling algorithm.

For a simple implementation, we can use arbitrarily large integers (BigInt). Unfortunately, the built-in JavaScript math libraries only work with number values, so we need to implement a bit of math.

BigInt math

First, we need to find the greatest common divisor of two numbers:

function gcd(a: bigint, b: bigint): bigint {
    while (b) {
        [a, b] = [b, a % b];
    }

    return a;
}

We use this to check if two numbers are coprime (their GCD is 1).

Next, we need modulo inverse (find x such that (a * x) % m == 1). One way of doing this is using Euclidean Division. We use the same algorithm we used for GCD, but we keep track of the values we find at each step. Finally, if a is 1, it means there is no modulo inverse. Otherwise we find the modulo inverse by starting with a pair of numbers x = 1, y = 0 and iterating over the values we found at the previous step, updating x to be y and y to be x - y * (a / b) where a and b are values we saved from the previous step:

function modInverse(a: bigint, m: bigint) {
    a = ((a % m) + m) % m;

    if (!a || m < 2) {
        throw new Error("Invalid input");
    }

    // Find GCD (and remember numbers at each step)
    const s = [];
    let b = m;
    while (b) {
        [a, b] = [b, a % b];
        s.push({ a, b });
    }

    if (a !== BigInt(1)) {
        throw new Error("No inverse");
    }

    // Find the inverse
    let x = BigInt(1);
    let y = BigInt(0);

    for (let i = s.length - 2; i >= 0; --i) {
        [x, y] = [y, x - y * (s[i].a / s[i].b)];
    }

    return ((y % m) + m) % m;
}

This gives us the modulo inverse. To recap, we use this once we have a large prime $P$ with $\phi = P - 1$ and a large prime $E$ such that $gcd(E, \phi) = 1$ to find our decryption key $D$.

We also need modulo exponentiation for encryption/decryption. Since we are dealing with large numbers, we will implement exponentiation using the ancient Egyptian multiplication algorithm. To raise b to e modulo m, if e is 1, we return b. Otherwise we recursively raise (b * b) % m to e / 2 modulo m. Whenever e is odd, we multiply the recursion result by an additional b:

function exp(b: bigint, e: bigint, m: bigint): bigint {
    if (e === BigInt(1)) {
        return b;
    }

    let result = exp((b * b) % m, e / BigInt(2), m);

    if (e % BigInt(2) === BigInt(1)) {
        result *= b;
    }

    return result % m;
}

This algorithm runs in log e time and keeps the large numbers to a manageable size since we apply modulo m at each step. We have most of the math pieces in place. The only thing missing is a way to generate large primes.

Generating large primes

One way of generating large primes is through trial and error: we generate a large number, check if it is prime, and repeat if it isn't. We can generate a large number by filling a byte array with random values, then converting it into a BigInt:

function randBigInt(sizeInBytes: number = 128): bigint {
    let buffer = new Uint8Array(sizeInBytes);
    crypto.getRandomValues(buffer);

    // Build a bigint out of the buffer
    let result = BigInt(0);
    buffer.forEach((n) => {
        result = result * BigInt(256) + BigInt(n);
    });

    return result;
}

This gives us a random number of as many bytes as we want (default being 128 bytes, i.e. 1024 bits). Since we are dealing with very large numbers, we can't naively test for primality of $N$ by trying divisions up to $\sqrt{N}$, this is too expensive. We instead use the probabilistic Miller-Rabin test.

In short, Miller-Rabin works like this: we can write an integer $N$ (our prime candidate) as $N = 2^S * D + 1$ where $S$ and $D$ are positive integers.

Let's take another integer $A$ coprime with $N$. $N$ is likely to be prime if $A^D \equiv 1 \pmod{N}$ or $A^{2^{R}*D} \equiv -1 \pmod{N}$ for some $0 <= R <= S$. If this is not the case, then $N$ is not a prime and $A$ is called a witness of the compositeness of $N$.

This is a probabilistic test, so we can tell whether $N$ is for sure non-prime or likely to be prime. Unfortunately, we can't tell for sure that $N$ is prime. We need to run multiple iterations of this picking different $A$ values until we are satisfied that $N$ is likely enough to be prime.

First, we need a helper function that checks $A$ is not a witness of $N$, given $A$, $N$, and $S$ and $D$ such that $N = S^2 * D + 1$.

We compute $U$ as $A^D \pmod{N}$. If $U - 1 = 0$ or $U + 1 = N$, then $A$ is not a witness of $N$. Otherwise, we repeat $S - 1$ times: $U = U^2 \pmod{N}$ and $A$ is not a witness if $U + 1 = N$. At this point, if we haven't confirmed that $A$ is not a witness, we consider $A$ a witness of $N$ thus $N$ is not prime. These are simply the checks described above ($A^D \equiv 1 \pmod{N}$ and $A^{2^{R}*D} \equiv -1 \pmod{N}$) in implementation form.

function isNotWitness(a: bigint, d: bigint, s: bigint, n: bigint): boolean {
    if (a === BigInt(0)) {
        return true;
    }

    // u is a ^ d % n
    let u = exp(a, d, n);

    // a is not a witness if u - 1 = 0 or u + 1 = n
    if (u - BigInt(1) === BigInt(0) || u + BigInt(1) === n) {
        return true;
    }

    // Repeat s - 1 times
    for (let i = BigInt(0); i < s - BigInt(1); i++) {
        // u = u ^ 2 % n
        u = exp(u, BigInt(2), n);

        // a is not a witness if u = n - 1
        if (u + BigInt(1) === n) {
            return true;
        }
    }

    // a is a witness of n
    return false;
}

With this, we can finally implement Miller-Rabin. We first check a few trivial cases (2 and 3 are prime, even numbers are non-prime). We then find $S$ and $D$ such that our number $N = 2^S * D + 1$ (we do this by factoring out powers of 2 from $N - 1$).

We then repeat the test: get a random number $A < N$. If $A$ is a witness of $N$, then $N$ is not prime. If we run this test enough times, we can safely assume the number is prime. According to this, 40 rounds should be good enough for a 1024 bit prime.

function millerRabinTest(candidate: bigint): boolean {
  // Handle some obvious cases
  if (candidate === BigInt(2) || candidate === BigInt(3)) {
      return true;
  }
  if (candidate % BigInt(2) === BigInt(0) || candidate < BigInt(2)) {
      return false;
  }

  // Find s and d
  let d = candidate - BigInt(1);
  let s = BigInt(0);

  while ((d & BigInt(1)) === BigInt(0)) {
      d = d >> BigInt(1);
      s++;
  }

  // Test 40 rounds.
  for (let k = 0; k < 40; k++) {
      let a = randBigInt() % candidate;

      if (!isNotWitness(a, d, s, candidate)) {
          return false;
      }
  }

  return true;
}

Note d and s above are technically only needed in isNotWitness(), but since they are based on our prime candidate, we compute them once and pass them as arguments to isNotWitness() rather than having to recompute them on each call of the function.

We can finally implement our prime generator. We simply generate large numbers and repeat until Miller-Rabin confirms we got a prime number:

function randPrime(sizeInBytes: number = 128): bigint {
    let candidate = BigInt(0);

    do {
        candidate = randBigInt(sizeInBytes);
    } while (!millerRabinTest(candidate));

    return candidate;
}

Cryptography

With the low-level math out of the way, we can implement the cryptography API. First, we will define an SRAKeyPair as consisting of the initial large prime $P$ and the derived $E$ and $D$ used for encryption/decryption:

type SRAKeyPair = {
    prime: bigint;
    enc: bigint;
    dec: bigint;
};

We can generate a large prime using randPrime(). From such a prime, we can generate an SRAKeyPair:

function generateKeyPair(largePrime: bigint, size: number = 128): SRAKeyPair {
    const phi = largePrime - BigInt(1);
    let enc = BigInt(0);

    // Trial and error
    for (;;) {
        // Generate a large prime
        enc = randPrime(size);

        // Stop when generated prime and passed in prime - 1 are coprime
        if (gcd(enc, phi) === BigInt(1)) {
            break;
        }
    }

    // enc is our encryption key, now let's find dec as the mod inverse of enc
    let dec = modInverse(enc, phi);

    return {
        prime: largePrime,
        enc: enc,
        dec: dec,
    };
}

If we have an SRAKeyPair, we can encrypt/decrypt numbers using the modulo exponentiation function we defined above (exp()):

function encryptInt(n: bigint, kp: SRAKeyPair) {
    return exp(n, kp.enc, kp.prime);
}

function decryptInt(n: bigint, kp: SRAKeyPair) {
    return exp(n, kp.dec, kp.prime);
}

We can also convert a string into a BigInt and vice-versa. Assuming we only have character codes below 256 (so ASCII), we can simply encode the string as a 256-base number where each digit is a character:

function stringToBigInt(str: string): bigint {
    let result = BigInt(0);

    for (const c of str) {
        if (c.charCodeAt(0) > 255) {
            throw Error(`Unexpected char code ${c.charCodeAt(0)} for ${c}`);
        }

        result = result * BigInt(256) + BigInt(c.charCodeAt(0));
    }

    return result;
}

The ASCII assumption is reasonable, since we use this at a protocol level, not as part of the user experience. We can decode such a number back into a string using division and modulo:

function bigIntToString(n: bigint): string {
    let result = "";
    let m = BigInt(0);

    while (n > 0) {
        [n, m] = [n / BigInt(256), n % BigInt(256)];
        result = String.fromCharCode(Number(m)) + result;
    }

    return result;
}

Now that we have these conversions, we can can implement string encryption/decryption on top of our encryptInt() and decryptInt() functions:

function encryptString(clearText: string, kp: SRAKeyPair): string {
    return bigIntToString(encryptInt(stringToBigInt(clearText), kp));
}

function decryptString(cypherText: string, kp: SRAKeyPair): string {
    return bigIntToString(decryptInt(stringToBigInt(cypherText), kp));
}

We can encode any object as a string (and decode back strings to objects):

function encrypt<T>(obj: T, kp: SRAKeyPair): string {
    return encryptString(JSON.stringify(obj), kp);
}

function decrypt<T>(cypherText: string, kp: SRAKeyPair): T {
    return JSON.parse(decryptString(cypherText, kp));
}

And that's it! We start with randPrime() to generate a large prime, then use generateKeyPair() to derive $E$ and $D$ from it. We can then use this SRAKeyPair with encrypt() and decrypt() to encrypt/decrypt objects using the commutative SRA algorithm.

Here is a small example pulling everything together:

// Seed prime used by both players to generate keys
const sharedPrime = randPrime();

const aliceKP = generateKeyPair(sharedPrime);
const bobKP = generateKeyPair(sharedPrime);

const card = "Ace of spades";

// Encrypt with Alice's key first, then Bob's
const aliceEncrypted = encryptString(card, aliceKP);
const aliceAndBobEncrypted = encryptString(aliceEncrypted, bobKP);

// Decrypt with Alice's key first, then Bob's
const bobEncrypted = decryptString(aliceAndBobEncrypted, aliceKP);
const decrypted = decryptString(bobEncrypted, bobKP);

// Prints "Ace of spades"
console.log(decrypted);

Summary

We went over a short overview of the SRA algorithm.
We looked at BigInt implementations for GCD, modulo inverse, and modulo exponentiation.
Then we generated random large numbers by filling a buffer, and testing for primality using the Miller-Rabin test.
With the math in place, we implemented a key generator for SRA (takin a prime and deriving $E$ and $D$).
We can encrypt/decrypt numbers by simply applying modulo exponentiation.
We can encrypt/decrypt any string by converting it to a BigInt, and more generally any object by stringifying it.

My work-in-progress Mental Poker Toolkit is here. This post covered the cryptography package.

Mental Poker Part 0: An Overview

Sat, 18 Feb 2023 00:00:00 -0800

Mental Poker Part 0: An Overview

I wrote previously about Mental Poker, how one can set up a game in a zero trust environment, and how this could be implemented using Fluid Framework.

Since the previous post, I spent some more time prototyping an implementation with a colleague and did a tech talk about it.

If you haven't read the previous post and are not familiar with Mental Poker, the following won't make much sense. Please start there or by watching the tech talk video.

The implementation consists of a few components:

A Fluid Framework append-only list distributed data structure - Used for tracking turns in a game.
Cryptography - An implementation of SRA (commutative encryption algorithm required by Mental Poker) and digital signing (required for authenticating messages from players).
Game client - A layer that abstracts communication between clients, with an implementation on Fluid Framework.
A state machine - Used to model games (If I make this move I expect the other player to make that move).
Recipes built on top of the state machine, like shuffling a deck of cards using the steps described in my previous blog post.

At the time of writing, the append-only list distributed data structure is ready, available on my GitHub as fluid-ledger and published on npm.

The other components will all eventually end up in the mental-poker-toolkit repo.

Some parts, like cryptography and the game client, I cleaned up and moved from a private hackathon repo. Other parts, like the state machine, require major rework, which I haven't gotten around to yet.

The plan is to provide a quality implementation with good documentation and samples. A major difference between the hackathon proof of concept and this is that the proof of concept implements a simple discard game while I'm hoping the toolkit can support games with more than two players.

Discard game

Modeling a game like Poker is non-trivial. That said, a big part of the complexity comes from the rules of the game itself. For a proof of concept of Mental Poker, we didn't want to get in the weeds of Poker rules, rather showcase the key ideas of how two players can shuffle a deck of cards, agree on what order the cards end up in, while at the same time each being able to maintain some private state (cards in hand). All of this done over a public channel (Fluid Framework).

The game we modeled was simple: players draw a hand of cards, then take turns discarding by number or suit. If a player can't discard (no matching number or suit), they draw cards until they can discard. The player who discards their whole hand first wins.

This prototype informed the components we had to build.

A new distributed data structure

Framework does not offer out of the box a data structure like the one needed to model a sequence of moves. We ended up using SharedObjectSequence, a data structure that was marked as deprecated and since removed from Fluid. In general, the Fluid data structures that support lists are overkill for Mental Poker as they support insertion and deletion of sequences of items at arbitrary positions. For modeling a game, we just need an append only list - players take turns and each move means appending something to the end of the list.

In fact, having an append-only list ensures that we don't run into issues like a client unexpectedly inserting something in the middle of the list, which doesn't make sense if we're modeling a sequence of moves in a game.

Cryptography

I was also not able to find a package providing commutative encryption. This is a key requirement for the Mental Poker protocol but industry standard cryptography algorithms do not have this property. I ended up implementing the SRA algorithm from scratch, including a bunch of BigInt math. I still strongly believe in the don't roll your own crypto rule, so please do not use my implementation to play Poker for real money.

Besides encryption, we also need digital signatures. When a player joins a game, they generate a public/private key pair and their first action is to post their public key. All subsequent moves from that player are signed with the private key, so other players can ensure the action is taken by the player claiming to take that action, eliminating spoofing. Fortunately we were able to use Crypto.subtle for this (see Crypto Web API).

State machine

Another interesting discovery was the state machine. A high-level game move, like I'm drawing a card from the top of the pile translates into a message exchange between the players:

Alice: I'm drawing a card from the top of the pile.
Bob: Here is my key for that card.

Shuffling cards, as described in the previous blog post, includes a longer sequence of steps. We needed a way to express I do this, then I expect the other player to reply with that. We can use such a state machine to express sequences of multiple moves to implement things like card shuffling.

The proof of concept state machine uses a queue of expected moves from the other player to implement the game mechanics and Mental Poker protocol. For example, for the Discard game, if it is the other player's turn, we expect two things can happen: they either discard a card or draw a card.

If they discard a card, then they publish their encryption key for the card which we can use to see the card (again, please refer to the previous Mental Poker post for details on the protocol). Alternately, if they can't discard a card, they need to draw a card, in which case we have to hand over our encryption key for the card on top of the deck.

Recipes

Some of the rules captured in this state machine are specific to each game implemented. Others though are simply steps in the Mental Poker protocol: things like shuffling, drawing cards etc. are all modeled as actions I take and actions I expect the other player to follow up with. I envision expressing such known sequences as recipes, building blocks for games.

As I mentioned before, the proof of concept state machine implementation requires some major rework. It needs to scale from two players to an arbitrary number of players, and needs to support recipes, which it currently doesn't. At the time of writing, this is one of the biggest chunks of pending work, and considering this is a hobby project I work on when time permits, I currently don't have a good sense of when I'll finish this. That said, a bunch of pieces are already in decent shape and public, so I plan to write about them while I continue working on finishing the toolkit.

Mental Poker series

In upcoming blog posts, I plan to cover the various pieces discussed above. The components address different problems, and I find all of them quite interesting. The problem space includes understanding how Fluid Framework distributed data structures work internally, how to generate large prime numbers, and how to model expected sequences of moves in a game among other things.

This post outlines the high level framing of the project. Following posts will dive deep into specific aspects.

In terms of applications, as I mention in the tech talk, the term games is pretty broad - we're not talking only about card games, but things like auctions, lotteries, blind voting etc. All of these can be implemented using Mental Poker as decentralized, zero-trust games.

Notes on Advent of Code 2022

Sat, 07 Jan 2023 00:00:00 -0800

Notes on Advent of Code 2022

I've been having fun solving Advent of Code problems every December for a few years now. Advent of Code is an advent calendar of programming puzzles.

All my solutions are on my GitHub here. First, a quick disclaimer:

Disclaimer on my solutions

I use Python because I find it easiest for this type of coding. I treat solving these as a write-only exercise. I do it for the problem-solving bit, so I don't comment the code & once I find the solution I consider it done - I donât revisit and try to optimize even though sometimes I strongly feel like there is a better solution. I don't even share code between part 1 and part 2 - once part 1 is solved, I copy/paste the solution and change it to solve part 2, so each can be run independently. I also rarely use libraries, and when I do it's some standard ones like re, itertools, or math. The code has no comments and is littered with magic numbers and strange variable names. This is not how I usually code, rather my decadent holiday indulgence. I wasn't thinking I will end up writing a blog post discussing my solutions so I would like to apologize for the code being hard to read.

With that long disclaimer out of the way, let's talk Advent of Code 2022. I figured I'll cover a few problems that seemed interesting to me during this round, before they fade in my memory. The first couple of weeks are usually easy, so I'll start from day 15.

Day 15: Beacon Exclusion Zone

Problem statement is here.

Part 1

Part 1 is pretty easy. We use taxicab geometry and for each sensor, we can find its scan radius by computing the Manhattan distance between its coordinates and the closest beacon it sees. Once we have this, we intersect each (taxicab) circle with the row y=2000000. This gives as a bunch of segments defined by (x0, x1) pairs.

import re

y, segments = 2000000, set()

for line in open('input').readlines():
    m = re.match('Sensor at x=(-?\d+), y=(-?\d+).*x=(-?\d+), y=(-?\d+)$', line)
    sx, sy, bx, by = map(int, m.groups())
    radius = abs(sx - bx) + abs(sy - by)

    if abs(sy - y) <= radius:
        segments.add(((sx - (radius - abs(sy - y)),
                     (sx + (radius - abs(sy - y))))))

We need to figure out where these overlap so we don't double-count so for each pair of segments, if they intersect, we replace them by their union until no segments intersect anymore. Then we simply sum the length of each segment:

def intersect(s1, s2):
    return s1[1] >= s2[0] and s2[1] >= s1[0]

def union(s1, s2):
    return (min(s1[0], s2[0]), max(s1[1], s2[1]))

done = False
while not done:
    done = True
    for s1 in segments:
        for s2 in segments:
            if s1 == s2:
                continue

            if intersect(s1, s2):
                segments.remove(s1)
                segments.remove(s2)
                segments.add(union(s1, s2))
                done = False
                break

        if not done:
            break

print(sum([s[1] - s[0] for s in segments]))

Part 2

Part 2 is more interesting. We need to scan a quite large area (both x and y between 0 and 4000000). We know that all points except one are covered by at least one sensor. We start from (0, 0) and scan like this: for each point, find the first sensor that sees it (Manhattan distance from sensor <= sensor radius). If no scanner can see it, we found our point. Otherwise, again relying on taxicab geometry, we can tell how many additional points to the right (increasing x) are still in range of this sensor. We move x beyond these ($x = x_sensor + radius - abs(y_sensor - y) + 1$). If x goes beyond 4000000, we reset it to 0 and increment y. This is not blazingly fast, but does the job in a reasonable amount of time (around 20 seconds on my machine).

import re

sensors = []

for line in open('input').readlines():
    m = re.match('Sensor at x=(-?\d+), y=(-?\d+).*x=(-?\d+), y=(-?\d+)$', line)
    sx, sy, bx, by = map(int, m.groups())
    radius = abs(sx - bx) + abs(sy - by)
    sensors.append((sx, sy, radius))

def in_range(x, y):
    for sensor in sensors:
        if abs(sensor[0] - x) + abs(sensor[1] - y) <= sensor[2]:
            return True, sensor

    return False, None

x, y = 0, 0
while True:
    found, sensor = in_range(x, y)
    if not found:
        break

    x = sensor[0] + sensor[2] - abs(sensor[1] - y) + 1
    if x > 4_000_000:
        x = 0
        y += 1

print(x * 4_000_000 + y)

Day 16: Proboscidea Volcanium

Problem statement is here.

Part 1

Part 1 is again pretty easy: we can model the valves and tunnels as a graph, then use the Floyd-Warshall algorithm to find the distances between each pair of valves:

import re

dist, flows, to_open = {}, {}, set()

for line in open('input').readlines():
    m = re.match(
        'Valve (\w+) has flow rate=(\d+); tunnels? leads? to valves? (.*)$', line)
    src, flow, *dst = m.groups()
    dst = [d.strip() for d in dst[0].split(',')]
    dist[src] = {d: 1 for d in dst} | {src: 0}
    flows[src] = int(flow)
    if flows[src] > 0:
        to_open.add(src)

for i in dist:
    for j in dist:
        if j not in dist[i]:
            dist[i][j] = 1000

for k in dist:
    for i in dist:
        for j in dist:
            if dist[i][j] > dist[i][k] + dist[k][j]:
                dist[i][j] = dist[i][k] + dist[k][j]

We can then search for the best solution recursively: we start from AA and keep track of which valves we opened (none for starters). Then at each step, pick one of the unopened valves. If we have enough time to reach them, recurse with updated location and set of opened nodes. We also compute the total pressure released so far at each step and keep track of the highest value we found. This gives us the solution.

best = 0

def search(current='AA', opened=set(), time=30, score=0):
    global best

    score += time * flows[current]

    if score >= best:
        best = score

    for node in to_open - opened:
        if time - dist[current][node] - 1 >= 0:
            search(node, opened | {node}, time -
                   dist[current][node] - 1, score)

search()

print(best)

Part 2

Part 2 is more fun. We now have an elephant to help us, which makes it a bit more complicated. My solution now keeps track of a few more things: which valve am I headed to and how many more minutes I have to get there; which valve is the elephant headed to and how many more minutes until it gets there. We both start at AA with an ETA of 0. Then for each node, if my ETA is 0, I'll be heading that way. If not, the elephant will be heading there. But since we're dealing with two ETAs, we need to figure out which of us will reach their destination first, and recurse to that time.

best = 0

def search(me=('AA', 0), elephant=('AA', 0), opened=set(), time=26, score=0):
    global best

    if score > best:
        best = score

    for node in to_open - opened:
        me_next, elephant_next, score_next = me, elephant, score
        if me[1] == 0:
            me_next = (node, dist[me[0]][node] + 1)
            score_next += (time - dist[me[0]][node] - 1) * flows[node]
        else:
            elephant_next = (node, dist[elephant[0]][node] + 1)
            score_next += (time - dist[elephant[0]][node] - 1) * flows[node]

        dt = min(me_next[1], elephant_next[1])
        me_next = (me_next[0], me_next[1] - dt)
        elephant_next = (elephant_next[0], elephant_next[1] - dt)

        if time - dt >= 0:
            search(me_next, elephant_next, opened |
                   {node}, time - dt, score_next)

search()

print(best)

This works but takes a long time, so I added some caching: since both the elephant and I move around a bunch, we can cache the score for each combination of my destination and ETA, the elephant's destination and ETA, and the time. If at a given minute, both the elephant and I were already in this situation but with a better score, we no longer need to keep searching this branch as we already found a better solution. This prunes enough of the search tree to easily find the answer. Updated search with cache:

best = 0
cache = {}

def search(me=('AA', 0), elephant=('AA', 0), opened=set(), time=26, score=0):
    global best

    if score > best:
        best = score

    key = str(me) + str(elephant) + str(time)
    if key in cache:
        if cache[key] >= score:
            return

    cache[key] = score

    for node in to_open - opened:
        me_next, elephant_next, score_next = me, elephant, score
        if me[1] == 0:
            me_next = (node, dist[me[0]][node] + 1)
            score_next += (time - dist[me[0]][node] - 1) * flows[node]
        else:
            elephant_next = (node, dist[elephant[0]][node] + 1)
            score_next += (time - dist[elephant[0]][node] - 1) * flows[node]

        dt = min(me_next[1], elephant_next[1])
        me_next = (me_next[0], me_next[1] - dt)
        elephant_next = (elephant_next[0], elephant_next[1] - dt)

        if time - dt >= 0:
            search(me_next, elephant_next, opened |
                   {node}, time - dt, score_next)

search()

print(best)

Day 17: Pyroclastic Flow

Problem statement is here.

Part 1

For part 1 we can simply simulate the falling blocks and find the answer. This gives us some of the building blocks needed for part 2.

jets = open('input').read()

rocks = [{(0, 0), (1, 0), (2, 0), (3, 0)}, 
         {(0, 1), (1, 0), (1, 1), (1, 2), (2, 1)},
         {(0, 0), (1, 0), (2, 0), (2, 1), (2, 2)},
         {(0, 0), (0, 1), (0, 2), (0, 3)},
         {(0, 0), (0, 1), (1, 0), (1, 1)}]

grid = set({(i, 0) for i in range(1, 8)})

def intersects(rock, grid):
    for block in rock:
        if block in grid or block[0] <= 0 or block[0] >= 8:
            return True
    return False

def move(rock, dx, dy):
    return {(i + dx, j + dy) for i, j in rock}

rock_i, jet_i = 0, 0
for _ in range(2022):
    top = max(grid, key=lambda pt: pt[1])[1]
    rock = move(rocks[rock_i], 3, top + 4)

    while True:
        new_pos = move(rock, 1 if jets[jet_i] == '>' else -1, 0)
        jet_i += 1
        if jet_i == len(jets):
            jet_i = 0
        if not intersects(new_pos, grid):
            rock = new_pos
        new_pos = move(rock, 0, -1)
        if intersects(new_pos, grid):
            break
        rock = new_pos

    grid |= rock
    rock_i += 1
    if rock_i == len(rocks):
        rock_i = 0

print(max(grid, key=lambda pt: pt[1])[1])

Part 2

Part 2 makes it obvious simulating everything is not an option as we need to look at a thousand billion rocks. The key here is to find a pattern: we are bound to end up simulating the same rock and initial move instruction over and over. If we do and we see the same gain in height between repeats, it means we found our repeating pattern. We know that starting from this position, we have a period of length period in which our tower of rocks grows by growth. We subtract the number of rocks we already simulated from 1000000000000, we divide by period and multiply by growth. We'll call this delta_top.

We are close to the final answer. The only thing left to do is simulate a few more steps: 1000000000000 minus the number of rocks we already simulated modulo period. Now we get the height of the top of the tower we simulated and add delta_top to it to find the final answer.

def top():
    return max(grid, key=lambda pt: pt[1])[1]

rock_i, jet_i = 0, 0
cache, delta_top = {}, 0
i = 0
while i < 10_000:
    rock = move(rocks[rock_i], 3, top() + 4)

    while True:
        new_pos = move(rock, 1 if jets[jet_i] == '>' else -1, 0)
        jet_i += 1
        if jet_i == len(jets):
            jet_i = 0
        if not intersects(new_pos, grid):
            rock = new_pos
        new_pos = move(rock, 0, -1)
        if intersects(new_pos, grid):
            break
        rock = new_pos

    grid |= rock
    rock_i += 1
    if rock_i == len(rocks):
        rock_i = 0

    i += 1
    
    if not delta_top:
        if (rock_i, jet_i) not in cache:
            cache[(rock_i, jet_i)] = []
        c = cache[(rock_i, jet_i)]
        c.append([i, top()])
        if len(c) > 2 and c[-1][1] - c[-2][1] == c[-2][1] - c[-3][1]:
            period, growth = c[-1][0] - c[-2][0], c[-1][1] - c[-2][1]
            delta_top = (1_000_000_000_000 - i) // period * growth
            i = 10_000 - (1_000_000_000_000 - i) % period

print(top() + delta_top)

Day 18: Boiling Boulders

Problem statement is here.

Part 1

Part is trivial so I won't discuss it here.

Part 2

Part 2 is also very easy, but I found a really neat solution worth sharing: since all boulders are within (0, 0, 0) and (20, 20, 20), I look at a grid encompassing everything ((-1, -1, -1) to (21, 21, 21)) and starting from (-1, -1, -1), flood fill. We use a queue and at each step we dequeue a triple of coordinates. If already visited or out of bounds, we ignore it and continue. Otherwise if it is a boulder, it means we found a new surface area. We mark these coordinates as visited and enqueue all the neighbors. I like how whenever we run into a boulder gives us exactly the area we are looking for. The full solution is:

cubes = [tuple(map(int, l.strip().split(','))) for l in open('input').readlines()]

visited, queue, area = set(), [(-1, -1, -1)], 0
while queue:
    (x, y, z) = queue.pop(0)

    if (x, y, z) in visited:
        continue

    if not (-1 <= x <= 22 and -1 <= y <= 22 and -1 <= z <= 22):
        continue

    if (x, y, z) in cubes:
        area += 1
        continue

    visited.add((x, y, z))
    queue.append((x - 1, y, z))
    queue.append((x + 1, y, z))
    queue.append((x, y - 1, z))
    queue.append((x, y + 1, z))
    queue.append((x, y, z - 1))
    queue.append((x, y, z + 1))

print(area)

Day 19: Not Enough Minerals

Problem statement is here.

I used the same solution for part 1 and part 2: a recursive search where we keep track of the bots and resources we have, and the time. The problem is it takes too long to simulate minute by minute. If we try deciding at each minute whether to build any of the bots we can build or keep collecting resources, then recurse to next minute, we end up with too much combinatorial complexity. My solution instead does something like this: for the current moment in time, for each type of robot, say we want to build that one next - based on costs and available resources, we can calculate how many minutes from now that robot be built. We can then recurse (jump ahead in time) there updating available resources, since we know other robots won't be built until then.

As an additional optimization, we can keep track of how many geodes we collected at each minute and if our current search has fewer geodes, it means we already found a better solution and it is not worth going down this branch. There's probably smarter caching/pruning we can do but this seems to be good enough.

This tames the combinatorial complexity enough to get a reasonable run time and going from simulating 24 minutes in part 1 to simulating 32 minutes for fewer blueprints in part 2 doesn't seem to require changing the algorithm. Both parts take around 2 minutes to run. It can probably be optimize further.

import re
import math

def run(bots, costs, resources, time):
    if best[time] > resources[3]:
        return

    best[time] = resources[3]

    if time == 0:
        return

    for bot_type in range(4):
        dt = math.ceil((costs[bot_type][0] - resources[0]) / bots[0])
        if bot_type >= 2:
            if bots[bot_type - 1] == 0:
                continue

            dt = max(dt, math.ceil((costs[bot_type][1] -
                                    resources[bot_type - 1]) / bots[bot_type - 1]))

        dt = max(dt, 0) + 1

        if time < dt:
            continue

        new_resources = [resources[i] + bots[i] * dt for i in range(4)]
        new_resources[0] -= costs[bot_type][0]
        if bot_type >= 2:
            new_resources[bot_type - 1] -= costs[bot_type][1]

        bots[bot_type] += 1
        run(bots, costs, new_resources, time - dt)
        bots[bot_type] -= 1

score = 1
for line in open('input').readlines()[:3]:
    m = re.match(
        '.*(\d+) ore.*(\d+) ore.*(\d+) ore and (\d+) clay.*(\d+) ore and (\d+) obsidian', line)
    costs = list(map(int, m.groups()))

    costs = [[costs[0]], [costs[1]], [
        costs[2], costs[3]], [costs[4], costs[5]]]

    best = [0] * 33

    run([1, 0, 0, 0], costs, [0] * 4, 32)

    score *= best[0]

print(score)

Day 20: Grove Positioning System

Problem statement is here.

Day 20 was very easy so I won't cover it here.

Day 21: Monkey Math

Problem statement is here.

Part 1

Another easy one. For part 1, we parse the input in an expression tree (with values at leaf nodes and operators at non-leaf nodes) and we recursively evaluate it from root.

tree = {}
for line in open('input').readlines():
    key, value = line.strip().split(': ')
    value = value.split(' ')
    if len(value) == 1:
        value = int(value[0])
    tree[key] = value


def get(key):
    if isinstance(tree[key], int):
        return tree[key]
    
    v1, v2 = get(tree[key][0]), get(tree[key][2])

    match tree[key][1]:
        case '+': return v1 + v2
        case '-': return v1 - v2
        case '*': return v1 * v2
        case '/': return v1 // v2

print(get('root'))

Part 2

Part 2 effectively makes the root be == and asks us to find the value for the humn node. For this, we can update our recursive evaluation to either compute a value or return None if humn is part of the subtree we're trying to evaluate (so if either left or right subtree evaluates to None, return None). We add another recursive function solve() which takes a node and an expected value (we expect the node to end up equal to the value) then we can recursively solve: evaluate left and right. Depending on which of them returns None, we recurse down that subtree with an updated expected value. For example, if we expect left + right to be 10 and we get 5 and None back, then we recurse down the right subtree, with an expected value of 10 - left.

tree = {}
for line in open('input').readlines():
    key, value = line.strip().split(': ')
    value = value.split(' ')
    if len(value) == 1:
        value = int(value[0])
    tree[key] = value

def get(key):
    if tree[key] == None or isinstance(tree[key], int):
        return tree[key]
    
    v1, v2 = get(tree[key][0]), get(tree[key][2])

    if v1 == None or v2 == None:
        return None

    match tree[key][1]:
        case '+': return v1 + v2
        case '-': return v1 - v2
        case '*': return v1 * v2
        case '/': return v1 // v2


def solve(key, eq):
    if tree[key] == None:
        return eq

    k1, k2 = tree[key][0], tree[key][2]
    v1, v2 = get(k1), get(k2)

    if v1 == None:
        match tree[key][1]:
            case '+': return solve(k1, eq - v2)
            case '-': return solve(k1, eq + v2)
            case '*': return solve(k1, eq // v2)
            case '/': return solve(k1, eq * v2)
    if v2 == None:
        match tree[key][1]:
            case '+': return solve(k2, eq - v1)
            case '-': return solve(k2, v1 - eq)
            case '*': return solve(k2, eq // v1)
            case '/': return solve(k2, v1 // eq)

tree['humn'] = None
tree['root'][1] = '-'

print(solve('root', 0))

Day 22: Monkey Map

Problem statement is here.

Part 1

This one was fun but a bit tedious. Part 1 is very easy, we implement movement with wrap-around and stopping when we hit #.

import re

grid = [line.strip('\n').ljust(150, ' ') for line in open('input').readlines()]
dirs, grid = [m.group() for m in re.finditer('(\d+)|L|R', grid[-1])], grid[:-2]
dirs = [int(d) if str.isdecimal(d) else d for d in dirs]

facing = [(1, 0), (0, 1), (-1, 0), (0, -1)]
x, y, d = grid[0].index('.'), 0, 0

def move(x, y, d):
    nx = (x + d[0]) % len(grid[0])
    ny = (y + d[1]) % len(grid)
    match grid[ny][nx]:
        case ' ': 
            nx, ny = move(nx, ny, d)
            return (nx, ny) if grid[ny][nx] != ' ' else (x, y)
        case '#': return (x, y)
        case '.': return (nx, ny)

for step in dirs:
    if isinstance(step, int):
        while step > 0:
            x, y = move(x, y, facing[d])
            step -= 1
    elif step == 'L':
        d = (d - 1) % 4 
    else:
        d = (d + 1) % 4

print(1000 * (y + 1) + 4 * (x + 1) + d)

Part 2

For part 2, we need to figure out how the various facets connect into a cube and map movement from one face to another. Personally, I made a paper cutout of the input shape, folded it, and used that to figure out the transitions:

The algorithm is pretty easy if the mappings are right. While on the same facet, we simply move in the direction we are supposed to move. We can encode a facet as a pair of (region_x, region_y) coordinates where region_x, region_y = x // 50, y // 50. Of course, some pairs of coordinates are not part of any facet of the cube (e.g. (0, 0)) but that doesn't matter. Using this encoding, we can tell when a movement gets us outside the current region. When that happens, we have a helper function which helps figure out where we end up and what is the new orientation.

import re

grid = [line.strip('\n').ljust(150, ' ') for line in open('input').readlines()]
dirs, grid = [m.group() for m in re.finditer('(\d+)|L|R', grid[-1])], grid[:-2]
dirs = [int(d) if str.isdecimal(d) else d for d in dirs]

size = 50
facing = [(1, 0), (0, 1), (-1, 0), (0, -1)]

connections = {
    (1, 0): [(2, 0, 0), (1, 1, 1), (0, 2, 0), (0, 3, 0)],
    (2, 0): [(1, 2, 2), (1, 1, 2), (1, 0, 2), (0, 3, 3)],
    (1, 1): [(2, 0, 3), (1, 2, 1), (0, 2, 1), (1, 0, 3)],
    (0, 2): [(1, 2, 0), (0, 3, 1), (1, 0, 0), (1, 1, 0)],
    (1, 2): [(2, 0, 2), (0, 3, 2), (0, 2, 2), (1, 1, 3)],
    (0, 3): [(1, 2, 3), (2, 0, 1), (1, 0, 1), (0, 2, 3)],
}

x, y, d = grid[0].index('.'), 0, 0

def move(x, y, d):
    nx = x + facing[d][0]
    ny = y + facing[d][1]
    nd = d

    if (x // size, y // size) != (nx // size, ny // size):
        nx, ny, nd = switch_region(x, y, d)

    match grid[ny][nx]:
        case '#': return (x, y, d)
        case '.': return (nx, ny, nd)

def switch_region(x, y, d):
    nrx, nry, nd = connections[(x // size, y // size)][d]
    nx, ny = nrx * size, nry * size
    rx, ry = x % size, y % size

    if (d, nd) in [(0, 0), (1, 3), (2, 2), (3, 1)]:
        return nx + size - rx - 1, ny + ry, nd
    if (d, nd) in [(0, 2), (1, 1), (2, 0), (3, 3)]:
        return nx + rx, ny + size - ry - 1, nd
    if (d, nd) in [(0, 1), (1, 0), (2, 3), (3, 2)]:
        return nx + size - ry - 1, ny + size - rx - 1, nd
    if (d, nd) in [(0, 3), (1, 2), (2, 1), (3, 0)]:
        return nx + ry, ny + rx, nd

for step in dirs:
    if isinstance(step, int):
        while step > 0:
            x, y, d = move(x, y, d)
            step -= 1
    elif step == 'L':
        d = (d - 1) % 4 
    else:
        d = (d + 1) % 4

print(1000 * (y + 1) + 4 * (x + 1) + d)

Day 23: Unstable Diffusion

Problem statement is here.

This is a cellular automaton. In general, when implementing cellular automata, the trick is to not change things in place, rather use a new copy for each generation. I represented the elves as a set of (x, y) coordinates. We can use set intersection to see if an elf has other elves nearby or whether two elves would end up moving in the same spot. I won't go into more detail as this was another pretty easy problem. The code is on my GitHub.

Day 24: Blizzard Basin

Problem statement is here.

Part 1

I liked this one. For both part 1 and part 2, this becomes easy to solve with a couple of interesting observations.

First the blizzards move in a repeating pattern so we can map which squares are occupied at a given point in time and we know the occupancy repeats every lcm(height, width) where height and width are the height and width of the valley. We can compute this many generations and store the occupancy map in a lookup.

import math

blizzards = []
lines = [line.strip() for line in open('input').readlines()]
for y, line in enumerate(lines):
    for x, c in enumerate(line):
        if c in '<^>v':
            blizzards.append((x, y, c))

maxx, maxy = len(lines[0]) - 1, len(lines) - 1
move = {'<': (-1, 0), '^': (0, -1), '>': (1, 0), 'v': (0, 1)}

def step(blizzards):
    new = []
    for b in blizzards:
        x, y = b[0] + move[b[2]][0], b[1] + move[b[2]][1]
        if x == 0: x = maxx - 1
        if x == maxx: x = 1
        if y == 0: y = maxy - 1
        if y == maxy: y = 1
        new.append((x, y, b[2]))
    return new

def occupancy(blizzards):
    return {(x, y) for x, y, c in blizzards}

steps, lcm = {}, math.lcm(maxx - 1, maxy - 1)
for i in range(lcm):
    steps[i] = {(x, y) for x, y, _ in blizzards}
    blizzards = step(blizzards)

Next, we can do a breadth-first search to find the closest path from one side to the other. Since a possible move is waiting one, its pretty hard to find bounds for a depth-first search. On the other hand, at every step the elves can occupy one of the at most height * width positions. Of course, most of these will be occupied by blizzards. So for a BFS, we start from the initial position and time (step 0) and use a queue. We pop the first move and enqueue all possible moves from this position (taking into account valley bounds and blizzard occupancy) for the next step. As long as we ensure not to enqueue duplicates, the queue stays small. Since this is BFS, as soon as the position we dequeue is our destination, we know this is the earliest we can get there.

def solve():
    queue = [(1, 0, 0)]
    while True:
        x, y, step = queue.pop(0)
        
        for x, y in [(x + m[0], y + m[1]) for m in move.values()] + [(x, y)]:
            if (x, y) == (maxx - 1, maxy):
                return step + 1

            if (x, y) != (1, 0):
                if x <= 0 or x >= maxx or y <= 0 or y >= maxy:
                    continue

            if (x, y) in steps[(step + 1) % lcm]:
                continue

            if (x, y, step + 1) not in queue:
                queue.append((x, y, step + 1))

print(solve())

Part 2

The extra trips are no problem since this is very fast. The only changes I had to make from part 1 to part 2 were modifying solve() to parameterize start, destination, and initial point in time, then call it 3 times for each trip:

def solve(src, dest, step):
    queue = [(src[0], src[1], step)]
    while True:
        x, y, step = queue.pop(0)
        
        for x, y in [(x + m[0], y + m[1]) for m in move.values()] + [(x, y)]:
            if (x, y) == (dest[0], dest[1]):
                return step + 1

            if (x, y) != (src[0], src[1]):
                if x <= 0 or x >= maxx or y <= 0 or y >= maxy:
                    continue

            if (x, y) in steps[(step + 1) % lcm]:
                continue

            if (x, y, step + 1) not in queue:
                queue.append((x, y, step + 1))

trip1 = solve((1, 0), (maxx - 1, maxy), 0)
trip2 = solve((maxx - 1, maxy), (1, 0), trip1)
trip3 = solve((1, 0), (maxx - 1, maxy), trip2)
print(trip3)

Day 25: Full of Hot Air

Problem statement is here.

Another easy one that I won't discuss in detail, we just need to implement conversion from decimal to SNAFU and back:

def to_dec(n):
    digits = {'0': 0, '1': 1, '2': 2, '-': -1, '=': -2}
    return sum([5 ** i * digits[d] for i, d in enumerate(n[::-1])])

def to_snafu(n):
    s = ''
    while n:
        s = ['0', '1', '2', '=', '-'][n % 5] + s
        n = n // 5 + (1 if s[0] in '-=' else 0)

    return s

print(to_snafu(sum([to_dec(line.strip()) for line in open('input').readlines()])))

In Advent of Code tradition, day 25 has only 1 part.

This was another very fun set of problems and I am looking forward to Advent of Code 2023.

Computability Part 9: LISP

Thu, 01 Dec 2022 00:00:00 -0800

Computability Part 9: LISP

In the previous post, we covered lambda calculus, a computational model underpinning functional programming. In this blog post, we'll continue down the functional programming road and cover one of the oldest programming languages still in use: LISP.

LISP was originally specified in 1958 by John McCarthy and the paper describing the language was published in 1960¹. It became very popular in AI research and flavors of it are still in use today.

LISP has a quite unique syntax and execution model.

S-expression

If we are going to talk about LISP, we need to start with symbolic expressions. Symbolic expressions, or S-expressions, are defined as:

An S-expression is either

an atom, or
an expression of the form (x . y)

where x and y are S-expressions.

This very simple definition is very powerful: it allows us to represent any binary tree. Let's start with a very simple universe where the only atom is (), representing a null value. With this atom and the above definition, while we can't (easily) represent data, we can capture the shape of a binary tree. For example, the tree consisting of a root node and two leaf nodes:

can be represented as (() . ()).

The tree consisting of a root, a left leaf node, and a right node with two child leaf nodes

would be (() . (() . ())).

If we expand the definition of atom to include numbers and basic arithmetic (+, -, *, /), we can represent arithmetic expressions as S-expressions. 2 + 3

can be represented as (+ . (2 . (3 . ())).

2 * (3 + 5)

can be represented as (* . (2 . ((+ . (3 . (5 . ()))) . ()).

Note the S-expression definition only allows for values (atoms) at leaf nodes of the tree. An S-expression is either a leaf node containing a value or a non-leaf node with 2 S-expression children. That means we can't represent 2 + 3 as

but the representation we just saw is equivalent.

Representing data

S-expressions can be used to represent data. Consider a simple list 1, 2, 3, 4, 5. Much like we saw in the previous post when we looked at representing lists as lambda expressions, we can represent lists using S-expressions using a head and a tail (recursively):

can be viewed as

or (1 . (2 . (3 . (4 . (5 . ()))))).

We can also represent an associative array: instead of a value, we can represent a key-value pair as an S-expression ((key . value)), so we can represent the associative array { 1: 2, 2: 3, 3: 5 } as ((1 . 2) . ((2 . 3) . ((3 . 5) . ()))).

Historically, a non-atom S-expression in LISP is called a cons cell (from construction). Instead of head and tail, LISP uses car and cdr (standing for contents of the address register and contents of the decrement register, which are artifacts of the computer architecture first flavors of LISP were implemented in).

We just saw how we can represent trees, lists, and associative arrays using S-expressions. But S-expressions aren't limited to representing data: we can also use them to represent code.

Representing code

We looked at how 2 + 3 would look like as an S-expression. In fact, we can represent any function call as an S-expression, where the left node of the root S-expression is the function to be called and the right subtree contains the arguments.

2 + 3 is equivalent to the function add(2, 3). So we can represent the function call add(2, 3) as the S-expression (add . (2 . (3 . ()))).

Note we can have any number of arguments as we grow the right subtrees: sum(2, 3, 4, 5) can be represented as (sum . (2 . (3 . (4 . (5 . ()))))). If we want to pass the result of another function as an argument, say sum(2, sum(3, 4), 5), we can represent this as (sum . (2 . ((sum . (3 . (4 . ()))) . (5 . ())) )).

We saw in the previous post that we can represent pretty much anything using functions. An if expression is a function if(condition, true-branch, false-branch). We can combine this with recursion to generate loops. So we have all the building blocks for a Turing-complete system.

It turns out we can represent both data and code as S-expressions. Before moving on to look at some implementation details, let's introduce some syntactic sugar.

Syntactic sugar

Writing S-expression like this can become tedious, so let's introduce some syntactic sugar. Instead of (1 . (2 . (3 . (4 . (5 . ()))))), we can write (1 2 3 4 5). We omit some of the parenthesis, the concatenation symbol ., and the final (). By default, we concatenate on the right subtree. If we need to go down the left subtree, we add parenthesis. So instead of representing the associative array { 1: 2, 2: 3, 3: 5 } as ((1 . 2) . ((2 . 3) . ((3 . 5) . ()))), we can more succinctly represent it as ((1 2) (2 3) (3 5)), without losing any meaning.

Similarly, (add . (2 . (3 . ()))) becomes (add 2 3) and (sum . (2 . ((sum . (3 . (4 . ()))) . (5 . ())))) becomes (sum 2 (sum 3 4) 5).

In our implementation, we will represent S-expressions as lists which can contain any number of elements. This is a more succinct representation and will make our code easier to understand.

Implementation

We can now look at implementing a small LISP. We take an input string, we parse it into an S-expression, then we evaluate the S-expression and print the result.

First, the parser: we will take a string as input, split it into tokens, then parse the tokens into an S-expression.

We will transform an input string into a list of tokens by matching it with either (, ), or a string of alphanumeric characters. We'll use a regular expression for this, then extract the matched values (using match.group()) into a list:

import re

def lex(line):
    return [match.group() for match in re.finditer('\(|\)|\w+', line)]

We can now transform an input like '(add 1 (add 2 3))' into the list of tokens ['(', 'add', '1', '(', 'add', '2', '3', ')', ')'] by calling lex() on it.

We need to transform this list of tokens into an S-expression. First, we need a couple of helper functions. An atom can be either a number or a symbol. We'll create one from a token using an atom() function:

def atom(value):
    try:
        return int(value)
    except:
        return value

The other helper function will yield while the head of our token list is different than ), then pop the ) token. We'll use this while parsing to iterate over the tokens after a ( and until we find the matching ):

def pop_rpar(tokens):
    while tokens[0] != ')':
        yield
    tokens.pop(0)

Parsing into an S-expression is now very simple:

If we find a (, we recursively parse the following tokens until we reach the matching ).
If we find a ), we raise an exception - this is an unmatched ).
Otherwise we have an atom - we return the result of calling atom() on it.

def parse(tokens):
    match token := tokens.pop(0):
        case '(':
            return [parse(tokens) for _ in pop_rpar(tokens)]
        case ')':
            raise Exception('Unexpected )')
        case _:
            return atom(token)

That's it. If we parse the input string '(add 1 (add 2 3))' using our functions - parse(lex('(add 1 (add 2 3))')) - we will get back ['add', 1, ['add', 2, 3]].

We can now take text as input and convert it into the internal representation we discussed.

The next step is to evaluate such an S-expression and return a result. We need two pieces for this: an environment which stores built-in functions and user-defined variables, and an evaluation function which takes an S-expression and processes it using the environment.

We'll start with a simple environment with built-in support for equality, arithmetic operations and list operations:

env = {
    # Equality
    'eq': lambda arg1, arg2: arg1 == arg2,

    # Arithmetic
    'add': lambda arg1, arg2: arg1 + arg2,
    'sub': lambda arg1, arg2: arg1 - arg2,
    'mul': lambda arg1, arg2: arg1 * arg2,
    'div': lambda arg1, arg2: arg1 / arg2,

    # Lists
    'cons': lambda car, cdr: [car] + cdr,
    'car': lambda list: list[0],
    'cdr': lambda list: list[1:],
}

Our evaluation function has a few special-case handling for variable definitions, quotations, and if-expressions, and is otherwise pretty straightforward:

def eval(sexpr):
    # If null or number atom, return it
    if sexpr == [] or isinstance(sexpr, int):
        return sexpr

    # If string atom, look it up in environment
    if isinstance(sexpr, str):
        return env[sexpr]

    match sexpr[0]:
        case 'def':
            env[sexpr[1]] = eval(sexpr[2])
        case 'quote':
            return sexpr[1]
        case 'if':
            return eval(sexpr[2]) if eval(sexpr[1]) else eval(sexpr[3])
        case call:
            return env[call](*[eval(arg) for arg in sexpr[1:]])

Our evaluation works like this:

If we have an atom representing the empty list, we return it.
If we have an atom that is a number, we'll return its value.
If we have an atom that is a string (a symbol), we'll look it up in the environment and return what we find there.
Otherwise we don't have an atom, rather an S-expression.
- If the first symbol is def, we add a definition to the environment.
- If the first symbol is quote, we return the second symbol unevaluated.
- If the first symbol is if, we evaluate the second symbol and if it is truthy, we evaluate the third symbol, otherwise the fourth symbol.
- If the first symbol doesn't denote a definition or an if expression, it is a function call: we grab the function from the environment, recursively evaluate all arguments, and pass them to the function.

We're taking a bit of a shortcut here and relying on Python's notion of truthy-ness (e.g. 0 or an empty list [] is non-truthy). If needed, we can enhance our implementation with Boolean support.

We can now implement a simple read-eval-print loop (REPL):

while line := input('> '):
    try:
        print(eval(parse(lex(line))))
    except Exception as e:
        print(f'{type(e).__name__}: {e}')

We can try a few simple commands (shown below with the corresponding output):

> (def a 40)
None
> (def b 2)
None
> (add a b)
42
> (if a 1 0)
1
> (add 2 (add 3 4))
9
> (def list (cons 1 (cons 2 (cons 3 ()))))
None
> (car list)
1
> (cdr list)
[2, 3]

Custom functions

We can extend the environment with additional functions as needed. These represent the built-in functions of our LISP interpreter. One capability we are still missing is the ability to define custom functions at runtime. Let's extend our interpreter to support that.

A function can take any number of arguments, which should become defined in the environment while the function is executing but which don't exist outside the function. For example, if we define an addition function as add(x, y), we should be able to refer to the x and y arguments inside the body of the function but not outside of it. x and y only exist within the scope of the function.

We can add scoping to our interpreter by extending our eval to take an environment as an argument instead of always referencing our env. Then when we create a new scope, we create a new environment to use.

For function definition, we will use the following syntax: (deffun function_name (arguments...) (body...)). deffun denotes a function definition. The second argument is the function name. The third is a list of parameters and the fourth is the body of the function, which is going to be evaluated in an environment where its arguments are defined.

We need a function factory:

def make_function(params, body, env):
    return lambda *args: eval(body, env | dict(zip(params, args)))

This takes the parameters, body, and environment and returns a lambda which expects a list of arguments. Calling the lambda will invoke eval on the body. Note we extend the environment with a dictionary mapping parameters to arguments.

Let's update eval to use a parameterized environment and support the new deffun function definition capability:

def eval(sexpr, env=env):
    # If number atom, return value
    if isinstance(sexpr, int):
        return sexpr

    # If string atom, look it up in environment
    if isinstance(sexpr, str):
        return env[sexpr]

    if sexpr == []:
        return []

    match sexpr[0]:
        case 'def':
            env[sexpr[1]] = eval(sexpr[2], env)
        case 'deffun':
            env[sexpr[1]] = make_function(sexpr[2], sexpr[3], env)
        case 'quote':
            return sexpr[1]
        case 'if':
            return eval(sexpr[2], env) if eval(sexpr[1], env) else eval(sexpr[3], env)
        case call:
            return env[call](*[eval(arg, env) for arg in sexpr[1:]])

Besides plumbing env through each eval call, we just added a deffun case where we use our function factory.

We can run our REPL again and try out the new capability:

> (deffun myadd (x y) (add x y))
None
> (myadd 2 3)
5

Here is a Fibonacci implementation, using deffun and recursion:

> (deffun fib (n) (if (eq n 0) 0 (if (eq n 1) 1 (add (fib (sub n 1)) (fib (sub n 2))))))
None
> (fib 8)
21

If n is 0, return 0 else if n is 1, return 1, else recursively call fib for n - 1 and n - 2 and add the results.

We won't provide a proof of Turing-completeness but it should be obvious that the capabilities we implemented so far are sufficient to emulate, for example, a cyclic tag system like we did in the previous post with lambdas.

Conclusions

The full implementation of our mini-LISP is here.

Peter Norvig wrote a much more detailed article describing a LISP implementation here.

LISP is a very interesting language as it uses the same representation for both data and code (for better or worse). Turns out binary trees (or trees if we use our syntactic sugar) are enough to represent both.

As we just saw, a core LISP runtime is fairly easy to implement and many of the more advanced features can be bootstrapped within the language itself.

Languages in the LISP family are called LISP dialects. Even though the language is many decades old, modern dialects are alive and thriving. For example Raket and Closure are LISP dialects.

Summary

In this post we looked at LISP:

S-expressions which describe binary trees.
Representing data as S-expressions.
Representing code as S-expressions.
A simple LISP implementation including a lexer, parser, environment, evaluation function, and a REPL.
Extending this with custom function definitions.

Original paper: http://www-formal.stanford.edu/jmc/recursive.pdf. ↩

Computability Part 8: Lambda Calculus

Fri, 14 Oct 2022 00:00:00 -0700

Computability Part 8: Lambda Calculus

In the previous posts, we dug deeper into one particular model of computation, starting with Turing Machines in part 2, to the von Neumann computer architecture in part 6, to some of the implementation practicalities of machines - physical or virtual - in part 7.

We'll switch gears and cover another computational model this time around: lambda calculus. Lambda calculus was developed by Alonzo Church around the same time Alan Turing was proposing the Turing machine as a universal model for computation. The Church-Turing thesis¹ proves the equivalence between the two models - anything a Turing machine can compute can also be computed by lambda calculus.

Formally:

Lambda calculus consists of lambda terms and reductions applied to lambda terms.

The lambda terms are built with the following rules, where $\Lambda$ is the set of all possible lambda terms:

Variables, like $x$, are lambda terms. $x \in \Lambda$.

Abstractions, $(\lambda x.M)$. This is a function definition where $M$ is a lambda term and $x$ becomes bound in the expression. For $x \in \Lambda$ and $M \in \Lambda$, $(\lambda x.M) \in \Lambda$.

Applications, $(M \space N)$. This applies the function $M$ to the argument $N$, where $M$ and $N$ are lambda terms. For $M \in \Lambda$ and $N \in \Lambda$, $(M \space N) \in \Lambda$.

If a term $y$ appears in $M$ but is not bound, then $y$ is free in $M$, e.g. for $\lambda x.y \space x$, $x$ is bound and $y$ is free. The reductions are:

$\alpha$-equivalence: bound variables in an expression can be renamed to avoid collisions: $(\lambda x.M[x]) \rightarrow (\lambda y.M[y])$.

$\beta$-reduction: bound variables in the body of an abstraction are replaced with the argument expression: $(\lambda x.t)s \rightarrow t[x := s]$.

$\eta$-reduction: if $x$ is a variable that does not appear free in the lambda term M, then $\lambda x.(M x) \rightarrow M$. This can also be understood in terms of function equivalence: if two functions give the same result for all arguments, then the functions are equivalent.

Let's look at a few simple examples in Python:

lambda x: x

This is the identity function expressed as a lambda abstraction. In this case, x (the lambda parameter), becomes bound in the body of the lambda.

$\alpha$-equivalence:

lambda y: y

This is the same identity function, we're just using y instead of x to name the parameter.

For function application, we can apply the identity function to any other lambda term and get back that lambda term:

(lambda x: x)(lambda y: y)

This applied the identify function lambda x: x to the argument lambda y: y, which will give us back lambda y: y.

Church encoding

Based on the above definition, lambda calculus consists exclusively of lambda terms - while (lambda x: x)(10) is valid Python code, applying an identity lambda to the number 10, lambda calculus does not have a number 10. Enter Church encoding: Alonzo Church came up with a way to encode logic values and numbers as lambda terms.

Logic

Let's start with Boolean logic: TRUE is defined as $T := (\lambda x.\lambda y.x)$, FALSE is defined as $F := (\lambda x.\lambda y.y)$.

TRUE = lambda x: lambda y: x
FALSE = lambda x: lambda y: y

Note with this definition, if we apply a first argument to TRUE, and a second argument to the returned lambda, we always get back the first argument. For FALSE, we always get back the second argument.

We can defined IF as $IF := (\lambda x.x)$. This is the same as the identity function.

IF = lambda x: x

This works since we defined TRUE to always return the first argument and FALSE to always return the second argument. So when we call IF(c)(x)(y), if c is TRUE, we get back x (the if-branch), otherwise we get back y (the else-branch).

We can try this out (though again this is outside of lambda calculus, we are introducing numbers for clarity):

IF(TRUE)(1)(2)  # This returns 1
IF(FALSE)(1)(2) # This returns 2

Now that we can express if-then-else, we can easily express other logic operators. Negation is $\lambda x.(x \space F \space T)$.

NOT = lambda x: x(FALSE)(TRUE)

If x is TRUE, we get back the first argument, FALSE; if x is FALSE, we get back the second argument, TRUE.

x AND y can be expressed as if x then y else FALSE, or: $\lambda x.\lambda y.(x \space y \space F)$. x OR y can be expressed as if x then TRUE else y, or $\lambda x.\lambda y.(x \space T \space y)$.

AND = lambda x: lambda y: x(y)(FALSE)
OR = lambda x: lambda y: x(TRUE)(y)

Here are a few examples:

print(AND(TRUE)(TRUE) == TRUE)  # prints True
print(AND(TRUE)(FALSE) == TRUE) # prints False
print(OR(TRUE)(FALSE) == TRUE)  # prints True
print(NOT(FALSE) == TRUE)       # prints True

Using only lambda terms, we were able to implement Boolean logic! But Church encoding goes further - we can also represent natural numbers and arithmetic as lambda terms.

Arithmetic

Alonzo Church encoded numbers as applications of a function $f$ to a term $x$.

0 means applying $f$ 0 times to the term: $0 := \lambda f.\lambda x.x$.
1 means applying $f$ once to the term: $1 := \lambda f.\lambda x.f x$.
2 means applying $f$ twice: $2 := \lambda f.\lambda x.f (f x)$.

In general, the number n is represented by n applications of f: $n := \lambda f.\lambda x.f (f (... (f x)) ... ))$ or $n := \lambda f.\lambda x. f^n(x)$.

In Python:

ZERO = lambda f: lambda x: x
ONE = lambda f: lambda x: f(x)
TWO = lambda f: lambda x: f(f(x))
...

Note ZERO is the same as FALSE. With this definition of numbers, we can define the successor function SUCC as a function that takes a number n (represented with our Church encoding), the function f, the term x, and applies f one more time. $SUCC := \lambda n.\lambda f.\lambda x.f (n f x)$.

SUCC = lambda n: lambda f: lambda x: f(n(f)(x))

We can define addition as $PLUS := \lambda m.\lambda n.m \space SUCC \space n$. Since we define a number as repeatedly applying a function, we express m + n as applying m times the successor function SUCC to n.

PLUS = lambda m: lambda n: m(SUCC)(n)

We can similarly define multiplication as applications of the PLUS function:

MUL = lambda m: lambda n: m(PLUS)(n)

We'll stop here with arithmetic, but this should hopefully give you a sense of the expressive power of lambda calculus.

Combinators

Some well-known lambda terms are called combinators:

$I$ is the identity combinator $I := \lambda x.x$.
$K$ is the constant combinator $K := \lambda x.\lambda y.x$. When applied to an argument $x$, it returns a constant function $K_x$ which returns $x$ when applied to any argument.
$S$ is the substitution combinator $S := \lambda x.\lambda y.\lambda z.x z (y z)$. $S$ takes 3 arguments, $x$, $y$, and $z$, applies $x$ to $z$, then applies the result of applying $y$ to $z$ to it.

In Python:

I = lambda x: x
K = lambda x: lambda y: x
S = lambda x: lambda y: lambda z: x(z)(y(z))

Turns out these 3 combinators can together express any lambda term. The SKI combinators are the simplest programming language since they can express anything expressable in lambda calculus, which we know is Turing-complete.

The Y combinator

Another interesting combinator is the $Y$ combinator. In lambda calculus, there is no way for a function to reference itself: within the body of a lambda like lambda x: ... we can refer to the bound term x, but we can reference the lambda itself. The implication is that we can't define, using this syntax, self-referential functions. We can only pass functions as arguments. How can we then implement recursion? With the $Y$ combinator, of course.

Let's take an example: we can recursively define factorial as:

def fact(n):
    return 1 if n == 0 else n * fact(n - 1)

This works, but note we reference fact() within its body. In lambda calculus we can't do that.

The $Y$ combinator is defined as $Y := \lambda f.(\lambda x.f (x x))(\lambda x.f (x x))$.

Y = lambda f: (lambda x: f(x(x)))(lambda x: f(lambda z: x(x)(z)))

Note the Python implementation is slightly different than the mathematical definition. This has to do with the way in which Python evaluates arguments. We won't go into the details here, but consider this a Python implementation detail irrelevant to the lambda calculus discussion².

Here is a lambda version of factorial:

FACT = lambda f: lambda n: 1 if n == 0 else n * f(n - 1)

With this definition, we pass the function to call as an argument (f). We can fully express this in lambda calculus (using Church numerals, arithmetic and logic), but we'll keep the example simple. We can then use the $Y$ combinator like this:

print(Y(FACT)(5))  # prints 120

This should give you an intuitive understanding of how the $Y$ combinator works: we pass it our function and argument, and it enables the recursion mechanism.

We can similarly implement Fibonacci as:

FIB = lambda f: lambda n: 1 if n <= 2 else f(n - 1) + f(n - 2)

print(Y(FIB)(10))  # prints 55

The powerful $Y$ combinator can be used to define recursive functions in programming languages that don't natively support recursion.

Lists

Let's also look at how we can express lists in lambda calculus. Let's start with pairs. We can define a pair as $PAIR := \lambda x.\lambda y.\lambda f. f x y$. We can extract the first element of a pair with $FIRST := \lambda p. p \space T$ and the second one with $SECOND := \lambda p.p \space F$.

PAIR = lambda x: lambda y: lambda f: f(x)(y)
FIRST = lambda p: p(TRUE)
SECOND = lambda p: p(FALSE)

print(FIRST(PAIR(10)(20)))  # prints 10
print(SECOND(PAIR(10)(20))) # prints 20

We can define a NULL value as $NULL := \lambda x.T$ and a test for NULL as $ISNULL := \lambda p.p (\lambda x.\lambda y.FALSE)$.

NULL = lambda x: TRUE
ISNULL = lambda p: p(lambda x: lambda y: FALSE)

We can now define a linked list as either NULL (an empty list) or as a pair consisting of a pair of elements - a head element and a tail list.

We can get the head of the list using FIRST and the tail using SECOND. Given list $L$, we can prepend an element $x$ by forming the pair $(x, L)$.

HEAD = FIRST
TAIL = SECOND
PREPEND = lambda x: lambda xs: PAIR(x)(xs)

We can build a list by prepending elements to NULL, and traverse it using HEAD and TAIL:

# Build the list [10, 20, 30]
L = PREPEND(10)(PREPEND(20)(PREPEND(30)(NULL)))

print(HEAD(TAIL(L))) # prints 20

Appending is more interesting: if our list is represented as a pair of head and tail, we need to traverse the list until we reach the end. This sounds a lot like a recursive function: appending x to xs entails returning the pair PAIR(x, NULL) if xs is NULL, else the pair PAIR(HEAD(xs), APPEND(TAIL(xs, x))). Fortunately, we just looked at the $Y$ combinator which allows us to express this.

Here is a simplified, readable implementation, using Python tuples:

_append = lambda f: lambda xs: lambda x: \
    (x, None) if not xs else (xs[0], f(xs[1])(x))

append = Y(_append)

print(append(append(append(None)(10))(20))(30))

# This will print (10, (20, (30, None)))

We can express the same using the lambdas we defined above (NULL, ISNULL, PAIR, HEAD, TAIL):

_APPEND = lambda f: lambda xs: lambda x: \
    ISNULL(xs) (lambda _: PAIR(x)(NULL)) (lambda _: PAIR(HEAD(xs))(f(TAIL(xs))(x))) (TRUE)

APPEND = Y(_APPEND)

L = APPEND(APPEND(APPEND(NULL)(10))(20))(30)

print(HEAD(L))       # prints 10
print(HEAD(TAIL(L))) # prints 20

We covered logic, arithmetic, combinators, pairs, and lists, all expressed as lambda terms. Let's also sketch a proof of Turing completeness, like we did in previous posts.

A sketch of Turing completeness

We're calling this a sketch, as lambda notation is not easy to read. We will instead look at an implementation using more Python syntax than just lambdas, but we will only use constructs which we know can be expressed in lambda calculus.

As usual, we will emulate another system which we know to be Turing-complete. In part 3 we looked at tag systems. We talked about cyclic tag systems, which can emulate m-tag systems, which are Turing-complete. As a reminder, a cyclic tag system is implemented as a set of binary strings (strings containing only 0s and 1s) which are production rules, and we process a binary input string by popping the head of the string and, if it is equal to 1, appending the current production rule to the string. We cycle through the production rules at each step. This is the code we used in the previous post:

def cyclic_tag_system(productions, string):
    # Keeps track of current production
    i = 0

    # Repeat until the string is empty
    while string:
        string = string[1:] + (productions[i] if string[0] == '1' else '')

        # Update current production
        i = i + 1
        if i == len(productions):
            i = 0

        yield string

We used the productions 11, 01, and 00 and the input 1:

productions = ['11', '01', '00']

string = '1'

print(string)
for string in cyclic_tag_system(productions, string):
    print(string)

Let's sketch an alternative implementation using the constructs we covered in this post.

First, we can describe our production rules as lists of Boolean values. We know how to represent Boolean values (TRUE and FALSE), and how to build a list using PAIR. Our productions can be represented as:

p1 = (True, (True, None))   # PAIR(TRUE)(PAIR(TRUE)(NULL))
p2 = (False, (True, None))  # PAIR(FALSE)(PAIR(TRUE)(NULL))
p3 = (False, (False, None)) # PAIR(FALSE)(PAIR(FALSE)(NULL))

productions = (p1, (p2, (p3, None)))

We can cycle through the list by processing the head, then appending it to the tail of the list. Here are simpler implementations of our list processing functions over Python tuples (though we know how to do these using only lambda terms):

def head(p):
    return p[0]

def tail(p):
    return p[1]

def append(xs, x):
    return (x, None) if not xs else (head(xs), append(tail(xs), x))

# If we want to cycle through our productions, we can do:
# productions = append(tail(productions), head(productions))

We'll also need a function to concatenate two lists. We can easily build this on top of append():

def concat(xs, ys):
    return xs if not ys else concat(append(xs, head(ys)), tail(ys))

While we still have ys, we append the head of ys to xs, then recurse with the tail of ys.

We process our input string as follows: if it is empty, we are done. If not, if the head is 1, we concatenate our current production to the end of the string, and recurse, cycling productions:

def cyclic_tag_system(productions, input):
    return None if not input else \
        cyclic_tag_system(
            # Cycle productions
            append(tail(productions), head(productions)),
            # If head is True, concatenate head production. Pop head input either way.
            concat(tail(input), head(productions)) if head(input) else tail(input))

Let's throw in a print() and run this on the same input as our original example:

def cyclic_tag_system(productions, input):
    print(input)
    return None if not input else \
        cyclic_tag_system(
            # Cycle productions
            append(tail(productions), head(productions)),
            # If head is True, concatenate head production. Pop head input either way.
            concat(tail(input), head(productions)) if head(input) else tail(input))

# The input is equivalent to the string '1'
cyclic_tag_system(productions, (True, None))

This should produce output very similar to our original cyclic_tag_system(), but using lists of Booleans instead of strings of 0s and 1s.

We emulated a cyclic tag system in lambda calculus - well, we didn't write all the code as lambda terms, but everything is expressed as one-liner functions that use only if-then-else expressions, lists (pair, head, tail), and recursion (for which we have the $Y$ combinator).

Lambda calculus has been extremely influential in computer science - it is the root of functional programming. LISP, one of the earliest programming languages, is heavily influenced by lambda calculus. Many ideas, like anonymous functions, also known as lambdas, are now broadly available in most modern programming languages (Python even uses the keyword lambda for these, as we saw in this post).

Summary

In this post we covered lambda calculus:

Lambda terms, including variables, abstractions, and applications.
Reductions: $\alpha$-equivalence, $\beta$-reduction, and $\eta$-reduction.
Church encoding for Boolean logic and arithmetic using lambda terms.
Combinators: the $S$, $K$, and $I$ combinators which are sufficient to encode all lambda terms, and the $Y$ combinator which enables recursion.
Pairs and lists (defined using pairs), including an append operation.
Emulating a cyclic tag systems in lambda calculus.

See this Wikipedia article. ↩
This blog post goes into the details if you are curious. ↩

Computability Part 7: Machine Implementation Practicalities

Fri, 02 Sep 2022 00:00:00 -0700

Computability Part 7: Machine Implementation Practicalities

In the previous post we covered the von Neumann architecture and even built a small VM implementing the different components. Such naÃ¯ve implementation does make for a very inefficient machine though. In this post, we'll dive a bit deeper into machine architectures (virtual and physical) and discuss some of the implementation details. We'll talk about processing: register and stack-based; we'll talk about memory: word size, byte and word addressing; finally, we'll talk about I/O: port and memory mapped. Note these are all machines that conform to the von Neumann architecture, with the same high-level components. We're just double clicking to the next level of implementation details.

Register machines

The VM we implemented in our previous post simply operated directly over the memory. This works for a toy example, but moving data from memory to the CPU and back is costly. That's why modern CPUs employ multiple layers of caching (we won't cover these in this post), and rely on a set of registers to perform operations.

Registers can store a number of bits (the word size, more on it below) and operations are performed using registers. For example, to add two numbers, the machine would load one number into register R0, the second number into register R1, add the values stored in registers R0 and R1, then finally save the result back to memory:

mov r0 @ # Move the value from memory address 1 to r0
mov r1 @ # Move the value from memory address 2 to r1
add r0 r1 # Add the values storing the result in r0
mov @ r0 # Move the value from r0 to memory address 3

Some register are used for general computation. These are called general-purpose registers. Other register have specialized purposes. For example, the program counter which keeps track of the instruction to be executed is usually implemented as an IP (instruction pointer) or PC (program counter) register.

The original 8088 Intel processor had 14 registers. Modern Intel processors have significantly more registers¹, though many of them are special-purpose. ARM processors have 17 registers², 13 of which are general purpose.

Let's emulate a simple CPU with 4 general purpose registers and a program counter register to get the feel of it. We will only implement mov (move) and add instructions for this example. Our implementation will check the 16th bit of an argument to determine whether it refers to a register (if 0) or to a memory location (if 1).

class CPU:
    def __init__(self, memory):
        self.memory = memory
        self.registers = [0, 0, 0, 0, 0] # r0, r1, r2, r3, pc

    def run(self):
        while self.registers[4] < len(self.memory):
            instr, arg1, arg2 = self.memory[
                self.registers[4]:self.registers[4] + 3]
            self.process(instr, arg1, arg2)
            self.registers[4] += 3

    def get_at(self, arg):
        # 16th bit tells us whether this refers to a register or memory
        if arg & (1 << 15): # Memory address
            return self.memory[arg ^ (1 << 15)]
        else: # Register
            return self.registers[arg]

    def set_at(self, arg, value):
        # 16th bit tells us whether this refers to a register or memory
        if arg & (1 << 15): # Memory address
            self.memory[arg ^ (1 << 15)] = value
        else: # Register
            self.registers[arg] = value

    def process(self, instr, arg1, arg2):
        match instr:
            case 0: # mov
                self.set_at(arg1, self.get_at(arg2))
            case 1: # add
                self.set_at(arg1, self.get_at(arg1) + self.get_at(arg2))

Here is how it would run a small program that adds two numbers and stores the result:

program = [
    0, 0, 15 | (1 << 15), # mov r0 @15
    0, 1, 16 | (1 << 15), # mov r1 @16
    1, 0, 1,              # add r0 r1
    0, 17 | (1 << 15), 0, # mov @17 r0
    0, 4, 18 | (1 << 15), # mov pc @18 - this ends execution
    40,                   # this is @15
    2,                    # this is @16
    0,                    # this is @17
    10000                 # this is @18
]

## Load program into memory
memory = [0] * 10000
memory = program + memory[len(program):]

print(memory[17]) # Should print 0

CPU(memory).run()

print(memory[17]) # Should print 42

We're doing a bunch of stuff by hand, like loading the program into memory and not using an assembler to implement the program. That's because we're only focusing on the register-based processing. You can update the assembler in the previous post to target this VM as an exercise.

Stack machines

An alternative to registers is to use a stack for storage. While hardware stack machines are not unheard of, register machines easily outperform them so most CPUs you interact with are register-based. That said, stack machines are a popular choice for virtual machines - they are easier to implement and port to different systems and the stack keeps the data being processed close together which helps with performance when running the VM on a physical machine. A few examples: JVM (the Java virtual machine), the CLR (the .NET virtual machine), CPython's VM (the VM for the reference Python implementation) are all stack-based.

The example we used above of adding two numbers would look like this on a stack machine: push the first number onto the stack, push the second number onto the stack, add the numbers (which would pop the two numbers from the stack and replace them with their sum), then pop the value from the stack and store it in memory.

push @ # Push a value from memory address 1
push @ # Push a value from memory address 2
add # Add the top two values
pop @ # Pop the top of the stack and store at memory address 3

Another advantage of stack machines is in general the instructions tend to be shorter. As you can see above, for most instructions that move data around, we don't need to specify both a source and a destination since the stack is implied.

Let's emulate a simple stack VM with only push, add, and pop instructions, plus a jmp (jump) instruction so we can use the same mechanism to terminate:

class CPU:
    def __init__(self, memory):
        self.memory = memory
        self.stack, self.pc = [], 0

    def run(self):
        while self.pc < len(self.memory):
            instr, arg = self.memory[self.pc:self.pc + 2]
            self.process(instr, arg)
            self.pc += 2

    def process(self, instr, arg):
        match instr:
            case 0: # push
                self.stack.append(self.memory[arg])
            case 1: # pop
                self.memory[arg] = self.stack.pop()
            case 2: # jmp
                self.pc = self.stack.pop()
            case 3: # add
                self.stack.append(self.stack.pop() + self.stack.pop())

Here is how it would run a small program that adds two numbers and stores the result:

program = [
    0, 12, # push @12
    0, 13, # push @13
    3, 0,  # add
    1, 14, # pop @14
    0, 15, # push @15
    2, 0,  # jmp
    40,    # this is @12
    2,     # this is @13
    0,     # this is @14
    10000, # this is @15
]

## Load program into memory
memory = [0] * 10000
memory = program + memory[len(program):]

print(memory[14]) # Should print 0

CPU(memory).run()

print(memory[14]) # Should print 42

Contrast the implementation with the register-based one: the latter VM only needs 1 argument for the instructions we implemented and the program is slightly shorter.

So far we focused on how data is processed. Let's also look at the different ways of referencing data.

Word size

We've been using Python for our toy implementations. Python supports arbitrarily large integers, so a list of numbers in Python (the way we implemented our memory) doesn't imply much in terms of bits and bytes. Bits and bytes do become important for physical machines and serious VMs implemented in languages closer to the metal.

First, let's talk about word size. A word is the fixed-size unit of computation for a CPU. It's size is the number of bits. For example, a 16-bit processor has a word-size of 16-bits.

Applied to registers, this would mean that a machine register can hold at most 16 bits (a value between 0 and 65535). Operations within the value range are blazingly fast, as they run natively. If we need to process larger values, we need to do extra work to chunk the values into words and process these in turn. For example we can split a 32-bit value into two 16-bit values, process them separately, then concatenate the result. This obviously impacts performance. The point being that we are not necessarily limited to the word size, but processing larger values becomes much costlier.

Applied to memory addresses, this would mean how pointers are represented and what range of values can be addressed. For example, if the word size is 16 bits, then a pointer can point to any one of 65536 distinct memory locations.

An architecture can use the same word size for both registers and pointers, or different word sizes for different concerns. Commonly, a single word size is used (and, potentially, fractions or multiples of it for special concerns), that's why it's common to refer to a processor as a 32-bit processor, 64-bit processor etc.

Byte and word addressing

An implication of word size applied to memory addressing is how the machine accesses memory. Some architectures allow byte addressing, which means a pointer points to a specific byte in memory, while others support only word addressing, which means a pointer points to a word in memory.

This is another important decision when designing a computer. If we want to be able to address individual bytes, a 16 bit pointer can refer to any of 65536 bytes. That is 64 Kb. If our memory is larger than that, a pointer won't be able to address higher locations.

On the other hand, if we make our memory word-addressable, for our 16-bit example, a pointer can refer to any of 65536 16-bit words. 16 bits are 2 bytes, so our memory's upper limit is 131072 bytes (65536 x 2), which is 128 Kb. We can now refer to higher memory addresses, but we can't address individual bytes as before - address 0 is no longer the byte at 0, is the whole 2-byte word (since address 1 refers to the next 2 bytes and so on).

This difference becomes even more dramatic for higher word sizes. A 32-bit pointer can address 4294967296 bytes (up to 4 Gb of memory). Alternately, with word addressing, the same pointer can cover 16 Gb.

On the flip side, word-addressing is less efficient when the unit of processing is smaller. Let's take text editing as an example. Say we want to update a one byte character, like a UTF-8 encoded common character like a. If we can refer to it directly, we can load, process, and update its memory location using a pointer. If, on the other hand, this character is part of a larger word, we would have to process the whole word to extract the character we care about (masking bits we don't need to process), apply the update to the whole word, and write this word back to memory.

So depending on the scenario, byte or word addressing might make things faster or slower. Byte addressing is great for text processing - document authoring, HTML, writing code etc. Word addressing unlocks larger memory sizes and is great for crunching numbers - math, graphics etc.

Another important design decision is how to handle I/O.

Port-mapped I/O

One way to connect I/O to the system is through specific CPU instructions. For example, the CPU might have an inp instruction used to consume input and an out instruction used to send output. Programs can use these instructions to perform I/O. This is called port-mapped I/O, as I/O is achieved by connecting devices to the CPU via dedicated ports.

For example, let's extend our stack machine with an out instruction (also connecting an output to it):

class CPU:
    def __init__(self, memory, out):
        self.memory, self.out = memory, out
        self.stack, self.pc = [], 0

    def run(self):
        while self.pc < len(self.memory):
            instr, arg = self.memory[self.pc:self.pc + 2]
            self.process(instr, arg)
            self.pc += 2

    def process(self, instr, arg):
        match instr:
            case 0: # push
                self.stack.append(self.memory[arg])
            case 1: # pop
                self.memory[arg] = self.stack.pop()
            case 2: # jmp
                self.pc = self.stack.pop()
            case 3: # add
                self.stack.append(self.stack.pop() + self.stack.pop())
            case 4: # out
                self.out(self.stack.pop())

Here is a program that prints Hello:

program = [
    0, 24, # push @24
    0, 25, # push @25
    0, 26, # push @26
    0, 27, # push @27
    0, 28, # push @28
    4, 0,  # out
    4, 0,  # out
    4, 0,  # out
    4, 0,  # out
    4, 0,  # out
    0, 29, # push @29
    2, 0,  # jmp
    111,   # this is @24
    108,   # this is @25
    108,   # this is @26
    101,   # this is @27
    72,    # this is @28
    10000, # this is @29
]

## Load program into memory
memory = [0] * 10000
memory = program + memory[len(program):]

def out(val):
    print(chr(val), end='')

CPU(memory, out).run()

Memory-mapped I/O

An alternative to port-mapped I/O is memory-mapped I/O. In this case, a certain address range of memory is used for I/O operations. That is, from the CPU's perspective, memory and I/O are addressed identically. But depending on the address range, data might reside in memory or it might actually come from/go to an I/O device.

Let's enhance our memory implementation (which so far was just an array) to support mapped I/O. In this case, any values written at address 1000 will be instead printed on screen:

class MappedMemory:
    def __init__(self, program):
        # MappedMemory wraps a list
        self.memory = [0] * 10000
        self.memory = program + self.memory[len(program):]

    def __len__(self):
        # Use underlying list's __len__
        return self.memory.__len__()

    def __getitem__(self, key):
        # Index in wrapped list
        return self.memory[key]

    def __setitem__(self, key, value):
        # If key is 1000, print
        if key == 1000:
            print(chr(value), end='')
        # Otherwise set in underlying list
        else:
            self.memory[key] = value

And here is the corresponding program that prints Hello (using the stack CPU without the out instruction and connected output):

program = [
    0, 24, # push @24
    0, 25, # push @25
    0, 26, # push @26
    0, 27, # push @27
    0, 28, # push @28
    1, 1000, # pop @1000
    1, 1000, # pop @1000
    1, 1000, # pop @1000
    1, 1000, # pop @1000
    1, 1000, # pop @1000
    0, 29,   # push @29
    2, 0,    # jmp
    111,     # this is @24
    108,     # this is @25
    108,     # this is @26
    101,     # this is @27
    72,      # this is @28
    10000,   # this is @29
]

## Load program into memory
memory = MappedMemory(program)

CPU(memory).run()

Note in this program we repeatedly set the value at address 1000 which is mapped to our output device (print()).

Summary

In this post we discussed some of the implementation details of machines and virtual machines:

Register machines, which are high-performance designs for physical machines.
Stack machines, which are simple, portable, and great alternatives for VMs.
Word size, as the unit of data used by the machine for different purposes.
Word-addressable memory, which is great for computation intensive scenarios.
Byte-addressable memory, which is best for text-processing scenarios.
Port-mapped I/O, where special CPU instructions are used for input and ouptut.
Memory-mapped I/O, where reserved address ranges of memory are used for I/O and the CPU can access I/O just like it does memory.

Bonus

A few years back I implemented a toy VM with 7 registers, 16 op codes, 128 KB of memory, and port-mapped I/O in 121 lines of C++. It comes with an assembler, examples, and, of course, a Brainfuck interpreter. Linking it here for reference: Pixie.

See this SO question. ↩
See the ARM documentation. ↩

Computability Part 6: Von Neumann Architecture

Sun, 31 Jul 2022 00:00:00 -0700

Computability Part 6: Von Neumann Architecture

During the previous posts, we covered Turing machines, tag systems, and cellular automata. All of these are equivalent in terms of what they can compute, but some are more practical than others. In this post, we'll look at the von Neumann architecture of physical computers and implement an extremely inefficient machine, write a few programs targeting it, then prove it is Turing complete.

John von Neumann was a famous mathematician and physicist. Contemporary with Alan Turing, he was aware of Turing's work on Turing machines and computability. At the same time, von Neumann was involved in the Manhattan Project which required lots of computation provided by some early computers. Thus he got involved in computer design. Unlike a Turing machine, a physical computer can't have an infinite tape and while data is processed based on input and states, this needs to be more ergonomic than Yurii Rogozhin's 4-state 6-symbol machine we described in Part 2.

Von Neumann described a computer architecture as consisting of the following components¹:

A central arithmetic component (CA) handling calculation.

A central control component (CC) driving which calculations should be performed.

Memory (M) for storage.

Input (I) and output (O) components to get data into the system and to communicate results outside of the system, from/to a recording medium (R)

Here is a diagram of this architecture:

Before von Neumann, computers were single-purpose devices - the programming was hardwired. One of the major innovations, which might not be apparent, is the introduction of a central control component and the ability of the memory to store not only data but also the program itself. This makes devices based on this architecture able to be reprogrammed to perform different tasks.

We can now load an arbitrary program into memory. The program will use the instructions which our central arithmetic understands to perform computations. The central control can read this program and have the central arithmetic perform the required operations. During execution, data is also read from/written to memory.

Programs (and data) is loaded into memory through the input component and results are sent through the output component.

While over the following decades this architecture got tweaked and tuned, it's pretty obvious it is the ancestor of all modern computers: computers still have CPUs, which include control and arithmetic, and memory.

Let's create a virtual machine based on this architecture.

Implementation

We will create a very simple machine based on this architecture in Python. In subsequent posts, we will look at other designs, but we're starting with a direct translation of this architecture.

Input

The interface to our input component is a function that, when called, returns an integer. This is all our machine needs to get data.

We will implement this over a text file. Our input component will buffer this file into a list and expose a read_one() function that will return one integer (as returned by [ord()]{.title-ref}) for each character from the buffer.

def inp(file):
    buffer = list(open(file).read())
    return lambda: ord(buffer.pop(0))

Output

The interface to our output component is a function that takes an integer as an argument. This is all our machine needs to output one memory cell.

We will implement this using print() and actually convert the given integer to a character. This is just to provide a convenient way for us to look at output like Hello world!.

def out(value):
    print(chr(value), end='')

Memory

Our memory will consist of a list of 10000 integers. We will zero-initialize the list, then load a program from a file to memory, starting at address 0. We expect the program to consist of a series of integers separated by a space or a newline character. We'll use this encoding to make it easier for us to peek at the code targeting our von Neumann machine.

def memory(file):
    memory = [0] * 10000
    for i, value in enumerate(' '.join(open(file).readlines()).split()):
        memory[i] = int(value)

10000 is chosen arbitrarily, at this point we're not worrying about word size, page alignment etc. We simply have room to store 10000 integers in our memory, which will include both code and data.

CPU

We'll package the control and arithmetic components into a CPU class. We'll initialize this class with memory, input, and output components.

class CPU:
    def __init__(self, memory, inp, out):
        self.memory, self.inp, self.out = memory, inp, out

Central control

Our control unit will maintain a program counter (PC), an index into the memory pointing to the next instruction to execute. The machine runs by reading 3 integers from memory (at PC, PC + 1 and PC + 2), and passing these to the arithmetic unit for processing. The program counter is then incremented by 3. This repeats until PC goes outside the bounds of the memory, at which point the machine halts (alternately we could have provided some HALT instruction).

def run(self):
    self.pc = 0
    while self.pc < len(self.memory):
        instr, m1, m2 = self.memory[self.pc:self.pc + 3]
        self.process(instr, m1, m2)
        self.pc += 3

We will implement process() next.

Central arithmetic

Our arithmetic unit will process triples of . It will support 8 instructions:

AT will set the value at memory address 1 to be the value at the memory address specified by the value at memory address 2 (in short, m[m1] = m[m[m2]]).
SET will set the value at the memory address specified by the value at memory address 1 to be the value at memory address 2 (in short, m[m[m1]] = m[2]).
ADD will update the value at memory address 1 by adding the value at memory address 2 to it (in short, m[m1] += m[m2]).
NOT will update the value at memory address 1 to be 0 if the value at memory address 2 is different than 0, or 1 if the value at memory address 2 is 0 (in short, m[m1] = !m[m2]).
EQ will compare the values at memory address 1 and memory address 2 and update the value at memory address 1 to be 1 if they are equal, 0 otherwise (in short, m[m1] = m[m1] == [m2]).
JZ will perform a conditional jump - if the value at memory address 1 is 0, it will update the program counter to point to memory address 2 (in short, if !m[m1] then PC = m[m2]).
INP will read one integer from the input and store it at memory address 1 + an offset value specified at memory address 2 (in short, m[m1 + m[m2]] = inp()).
OUT will write the value at memory address 1 + an offset value specified at memory address 2 to the output (in short, out(m[m1 + m[m2]]).

Since the instructions are also read from memory, which is a list of integers, we will encode them as integers: AT = 0, SET = 1, ... OUT = 7.

def process(self, instr, m1, m2):
    match instr:
        case 0: # AT
            self.memory[m1] = self.memory[self.memory[m2]]
        case 1: # SET
            self.memory[self.memory[m1]] = self.memory[m2]
        case 2: # ADD
            self.memory[m1] += self.memory[m2]
        case 3: # NOT
            self.memory[m1] = +(not self.memory[m2])
        case 4: # EQ
            self.memory[m1] = +(self.memory[m1] == self.memory[m2])
        case 5: # JZ
            if not self.memory[m1]:
                # Set PC to m2 - 3 since run() will increment PC by 3
                self.pc = m2 - 3
        case 6: # INP
            self.memory[m1 + self.memory[m2]] = self.inp()
        case 7: # OUT
            out(self.memory[m1 + self.memory[m2]])
        case _: 
            raise Exception("Unknown instruction")

Von Neumann VM

Putting it all together, we'll take two input arguments: the first one (argv[1]) will represent the code input file containing the program, the second one (argv[2]) will be the file containing additional input to be consumed by the inp() function:

import sys

vn = CPU(memory(sys.argv[1]), inp(sys.argv[2]), out)
vn.run()

Here is our von Neumann virtual machine in one listing:

def inp(file):
    buffer = list(open(file).read())
    return lambda: ord(buffer.pop(0))

def out(value):
    print(chr(value), end='')

def memory(file):
    memory = [0] * 10000
    for i, value in enumerate(' '.join(open(file).readlines()).split()):
        memory[i] = int(value)
    return memory

class CPU:
    def __init__(self, memory, inp, out):
        self.memory, self.inp, self.out = memory, inp, out

    def run(self):
        self.pc = 0
        while self.pc < len(self.memory):
            instr, m1, m2 = self.memory[self.pc:self.pc + 3]
            self.process(instr, m1, m2)
            self.pc += 3

    def process(self, instr, m1, m2):
        match instr:
            case 0: # AT
                self.memory[m1] = self.memory[self.memory[m2]]
            case 1: # SET
                self.memory[self.memory[m1]] = self.memory[m2]
            case 2: # ADD
                self.memory[m1] += self.memory[m2]
            case 3: # NOT
                self.memory[m1] = +(not self.memory[m2])
            case 4: # EQ
                self.memory[m1] = +(self.memory[m1] == self.memory[m2])
            case 5: # JZ
                if not self.memory[m1]:
                    # Set PC to m2 - 3 since run() will increment PC by 3
                    self.pc = m2 - 3
            case 6: # INP
                self.memory[m1 + self.memory[m2]] = self.inp()
            case 7: # OUT
                out(self.memory[m1 + self.memory[m2]])
            case _: 
                raise Exception("Unknown instruction")

import sys

vn = CPU(memory(sys.argv[1]), inp(sys.argv[2]), out)
vn.run()

We can save this as vn.py.

Let's create a Hello world! program targeting this machine. We will use the OUT instruction to output each character of Hello and a new line (\n). We'll first tell the VM to output the values at memory address 21 to 26:

We are referencing addresses 21 to 26 plus the offset 0 (the value at memory 9999, since our memory is initialized with zeros).

We want to halt after this, so we need to jump our program counter to 10000. We will do this by using our JZ instruction, saying if the memory value at index 9999 is 0, jump to 10000:

5 9999 10000

Now we get to memory address 21, so we will set the values of memory 21 to 26 to the values of the characters in Hello (as returned by ord()) plus a 10 for \n:

72 101 108 108 111 10

Here is the full listing which we can save as hello.vn:

7 21 9999
7 22 9999
7 23 9999
7 24 9999
7 25 9999
7 26 9999
5 9999 10000
72 101 108 108 111 10

We can then use our VM to run the program like this:

touch input
python3 vn.py hello.vn input

We're also creating a blank input file since Hello world! isn't going to read anything via inp().

Running this should print Hello. Our program is pretty hard to write or read, we're programming with integers. Let's make our life a bit easier.

Assembler

We will implement an assembler for our VM. An assembly language is a low-level language closely matching the architecture it targets (in our case, our very simple von Neumann machine).

Our assembler will take 2 arguments - an input file and an output file - and automatically translate the input (assembly language) into instructions for our VM.

We will add the following features:

Comments - Lines starting with # will be ignored.
Instructions - We will express our instructions as at, set, add, not, eq, jz, inp, out to represent the instructions 0, 1, ... 5.
Labels - We will tag a location in the code by a string ending in :, for example HERE:. We will then be able to refer to the location using the identified preceded by :, like :HERE. We will also allow adding an offset to a reference: :HERE+2 is 2 past the HERE label.
ORD macro - To make implementing Hello world! easier, we will provide the ORD() macro which will return the integer representation of the character passed to it, for example ORD(H) will return 72.

Using this assembly language, we can rewrite Hello world! as:

## Print 6 characters starting from DATA
out :DATA 9999
out :DATA+1 9999
out :DATA+2 9999
out :DATA+3 9999
out :DATA+4 9999
out :DATA+5 9999

## End program
jz 9999 10000

## Data section
DATA: ORD(H) ORD(e) ORD(l) ORD(l) ORD(o) 10

First, we'll read the input file and convert it into a list of tokens. We will ignore lines starting with # (so we can add comments to our assembly file).

import sys

if len(sys.argv) != 3:
    print("Usage: asm.py  ")
    exit()

## Read all lines into a list
lines = open(sys.argv[1]).readlines()
## Filter out blank lines and lines starting with '#'
lines = list(filter(lambda line: line and line[0] != '#', lines))
## Join all lines and split into tokens
tokens = ' '.join(lines).split()

The labels themselves aren't part of the program, rather mark locations in the program, so in the next step we will pluck these out from the list of tokens but retain the index they are referencing:

## pluck labels and remember position
labels, i = {}, 0
while i < len(tokens):
    # If not a label, advance
    if tokens[i][-1] != ':':
        i += 1
        continue

    # Store location and pluck label
    labels[tokens[i][:-1]] = i
    tokens.pop(i)

Now we will process all tokens and handle the following cases:

If token starts with :, it is a label reference, so replace it with the actual location (as stored during the previous step).
If the token is an op code, replace it with the integer value of the op code.
If the token is an ORD() macro, replace the character passed to ORD() with its value.

## Op code list (constant)
OP_CODES = ['at', 'set', 'add', 'not', 'eq', 'jz', 'inp', 'out']

for i, token in enumerate(tokens):
    # replace label references with actual position
    if token[0] == ':':
        if '+' in token:
            base, offset = token.split('+')
            tokens[i] = labels[base[1:]] + int(offset)
        else:
            tokens[i] = labels[token[1:]]

    # replace op codes with values
    if token in OP_CODES:
        tokens[i] = OP_CODES.index(token)

    # replace ORD macro
    if token[:4] == 'ORD(':
        tokens[i] = ord(token[4:-1])

Finally, we write all tokens to the output file:

open(sys.argv[2], "w").write(
    ' '.join([str(token) for token in tokens]))

Here is the full source code of our assembler (asm.py):

import sys

if len(sys.argv) != 3:
    print("Usage: asm.py  ")
    exit()

## Read all lines into a list
lines = open(sys.argv[1]).readlines()
## Filter out blank lines and lines starting with '#'
lines = list(filter(lambda line: line and line[0] != '#', lines))
## Join all lines and split into tokens
tokens = ' '.join(lines).split()

## pluck labels and remember position
labels, i = {}, 0
while i < len(tokens):
    # If not a label, advance
    if tokens[i][-1] != ':':
        i += 1
        continue

    # Store location and pluck label
    labels[tokens[i][:-1]] = i
    tokens.pop(i)

## Op code list (constant)
OP_CODES = ['at', 'set', 'add', 'not', 'eq', 'jz', 'inp', 'out']

for i, token in enumerate(tokens):
    # replace label references with actual position
    if token[0] == ':':
        if '+' in token:
            base, offset = token.split('+')
            tokens[i] = labels[base[1:]] + int(offset)
        else:
            tokens[i] = labels[token[1:]]

    # replace op codes with values
    if token in OP_CODES:
        tokens[i] = OP_CODES.index(token)

    # replace ORD macro
    if token[:4] == 'ORD(':
        tokens[i] = ord(token[4:-1])

open(sys.argv[2], "w").write(
    ' '.join([str(token) for token in tokens]))

We can now save our assembly Hello world! (listed above) to a file, let's call it hello.asm and use the assembler to convert it to a program our VM can execute:

python3 asm.py hello.asm hello.vn

The resulting hello.vn should have the same content as our hand-crafted Hello world!, minus the newlines (the assembler doesn't output newlines). The content of the assembled file hello.vn is:

7 21 9999 7 22 9999 7 23 9999 7 24 9999 7 25 9999 7 26 9999 5 9999 10000 72 101 108 108 111 10

We can run this using:

python3 vn.py hello.vn input

We are again using an empty input file since we don't need input. As a convention, we use the .asm extensions for assembly files and .vn for assembled files targeting the VM.

Variables and loops

Let's rewrite our program: instead of outputting :DATA, then :DATA+1, then DATA+2... we should be able to output :DATA + :I where :I goes from 0 to 5.

We can easily use a variable by tagging any part of the program then referencing it, then using that label to refer to the variable.

I: 0

Then we can use :I to reference to it. We will use a COUNTER variable to count down from 6 to 0, and an offset variable I:

## Variables
I: 0
COUNTER: 6

We also need a couple of constant values: 0, 1 - by which we increment I during each iteration, and -1 to decrement COUNTER during each iteration. And, of course, our DATA, where we store the Hello string:

## Constants
CONST: 0 1 -1

## Data
DATA: ORD(H) ORD(e) ORD(l) ORD(l) ORD(o) 10

Now lets look at how we can implement a loop using JZ:

## Beginning of loop
LOOP: 
## Output I
out :DATA :I
## Decrement COUNTER, increment I
add :COUNTER :CONST+2
add :I :CONST+1
## If COUNTER is 0, we're done
jz :COUNTER 10000
## If not, jump to the start of the loop
jz :CONST :LOOP

At each iteration, our loop will output the character value at DATA plus the offset specified in I (initially 0). Then we subtract -1 from our COUNTER and add 1 to I. Since our VM uses memory addresses for all operations, we stored 1 and -1 in memory at CONST and CONST+1 respectively.

If COUNTER is 0, we're done, so we jump to 10000. If not, we repeat the loop (jump to LOOP if CONST is 0, but CONST is always 0).

Here is the full listing of this program:

## Beginning of loop
LOOP: 
## Output I
out :DATA :I
## Decrement COUNTER, increment I
add :COUNTER :CONST+2
add :I :CONST+1
## If COUNTER is 0, we're done
jz :COUNTER 10000
## If not, jump to the start of the loop
jz :CONST :LOOP

## Constants
CONST: 0 1 -1

## Data
DATA: ORD(H) ORD(e) ORD(l) ORD(l) ORD(o) 10

## Variables
I: 0
COUNTER: 6

We can save this as hello2.asm, then assemble and run it:

python3 asm.py hello2.asm hello2.vn
python3 vn.py hello2.vn

Notes

A few notes: data is mixed with code in all our programs, which follows from the von Neumann architecture, in which the memory of the system stores both code and data. This is fundamentally true for all computers, and enables some interesting behavior like self-modifying code. This could be intentional, or we could, accidentally due to a bug, interpret data as code or vice-versa, code as data. Modern systems employ various additional protections to prevent this type of accidental usage.

Because our particular VM starts execution from memory location 0, we have to place our constants and variables (data) after the instructions in the program. Executable files on modern systems similarly contain code and data segments, albeit with more complex layout and rules.

Turing-completeness

Let's prove our simple von Neumann VM is Turing-complete, meaning capable of universal computation. As we saw throughout this series of blog posts, the best way to prove this is to emulate another known Turing-complete system.

We will prove this by implementing a Brainfuck interpreter. We covered Brainfuck during the second post in the series, under Esoteric Turing machines. To recap: Brainfuck (BF) uses a byte array (tape), a data pointer (index in the array), and 8 symbols: >, <, +, -, ., ,, [, ]. The symbols are interpreted as:

>: Increment the data pointer (move head right).
<: Decrement the data pointer (move head left).
+: Increment array value at data pointer.
-: Decrement array value at data pointer.
.: Output value at data pointer.
,: Read 1 byte of input and store at data pointer.
[: If the byte at data pointer is 0, jump right to the matching ], else increment data pointer.
]: If the byte at data pointer is not 0, jump left to the matching [, else decrement data pointer.

We will use our assembly language to implement a program which reads a BF program from input, then executes it. Effectively, we'll use our very simple virtual machine to emulate another very simple virtual machine!

I won't cover the details of the implementation, since it is quite cumbersome due to the simplicity of our VM and assembly language. I will just provide a short summary of what is going on:

We'll start by reading the BF program from input, until we encounter a newline (\).
We will use a CODE_PTR code pointer variable to point to the current BF instructions and a DATA_PTR data pointer variable to point to the BF array.
We'll overlay the BF array BF array over the VM memory, starting at address 5000 (middle of our memory).
We will then handle each possible input (>, <, etc.).
Most of the instructions are easy to implement, the most complex are [ and ], which require keeping track of unbalanced parenthesis so we properly jump from [ to matching ] and vice-versa.

Here is the full Brainfuck interpreter implemented in our assembly language:

## Read Brainfuck program until a \n is encountered
START:
## Read one integer at PROG + offset I
inp :PROG :I
## Increment I by 1
add :I :CONST+1
## Zero out DONE_READING (!1)
not :DONE_READING :CONST+1
## DONE_READING = 10
add :DONE_READING :CONST+3
## Load the last integer we read in TEMP
at :TEMP :END
## Increment END to keep track of program end
add :END :CONST+1
## Check if the last integer we read was 10 (\n)
eq :DONE_READING :TEMP
## If it wasn't zero, jump to start and read another value
jz :DONE_READING :START 

## Start running program
BF_RUN:
at :TEMP :CODE_PTR
add :CODE_PTR :CONST+1

## Check if we're on a > instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :RIGHT

## Check if we're on a < instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+1
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :LEFT

## Check if we're on a + instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+2
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :INC

## Check if we're on a - instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+3
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :DEC

## Check if we're on a . instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+4
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :OUT

## Check if we're on a , instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+5
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :IN

## Check if we're on a [ instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+6
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :FORWARD

## Check if we're on a ] instruction
not :TEMP2 :CONST+1
add :TEMP2 :BF+7
eq :TEMP2 :TEMP
not :TEMP2 :TEMP2
jz :TEMP2 :BACKWARD

## No matching BF instruction so we're done
jz :CONST 10000

RIGHT:
## > - increment data pointer
add :DATA_PTR :CONST+1
jz :CONST :BF_RUN

LEFT:
## < - decrement data pointer
add :DATA_PTR :CONST+2
jz :CONST :BF_RUN

INC:
## + - increment cell
at :TEMP :DATA_PTR
add :TEMP :CONST+1
set :DATA_PTR :TEMP
jz :CONST :BF_RUN

DEC:
## - - decrement cell
at :TEMP :DATA_PTR
add :TEMP :CONST+2
set :DATA_PTR :TEMP
jz :CONST :BF_RUN

OUT:
## . - output cell
at :TEMP :DATA_PTR
out :TEMP :CONST
jz :CONST :BF_RUN

IN:
## , - store input in cell
inp :TEMP :CONST
set :DATA_PTR :TEMP
jz :CONST :BF_RUN

FORWARD:
## [
at :TEMP :DATA_PTR    
not :TEMP :TEMP
## If value in cell is not 0, continue
jz :TEMP :BF_RUN
## Find matching ]
## Set TEMP to 1, counting unbalanced [
not :TEMP :TEMP
add :TEMP :CONST+1
SCAN_FORWARD:
at :TEMP2 :CODE_PTR
eq :TEMP2 :BF+6
not :TEMP2 :TEMP2
## Jump if found a [
jz :TEMP2 :FORWARD_LPAR
at :TEMP2 :CODE_PTR
eq :TEMP2 :BF+7
not :TEMP2 :TEMP2
## Jump if found a ]
jz :TEMP2 :FORWARD_RPAR
## Keep scanning
add :CODE_PTR :CONST+1
jz :CONST :SCAN_FORWARD
## Increment counter when finding a [
FORWARD_LPAR:
add :TEMP :CONST+1
add :CODE_PTR :CONST+1
jz :CONST :SCAN_FORWARD
## Decrement counter when finding a ]
FORWARD_RPAR:
add :TEMP :CONST+2
## If counter is 0, we're done
jz :TEMP :BF_RUN
## Else keep scanning
add :CODE_PTR :CONST+1
jz :CONST :SCAN_FORWARD

BACKWARD:
## ]
at :TEMP :DATA_PTR    
## If value in cell is 0, continue
jz :TEMP :BF_RUN
## Find matching [
## Set TEMP to 1, counting unbalanced ]
not :TEMP :TEMP
add :TEMP :CONST+1
## Move code pointer back 2
add :CODE_PTR :CONST+2
add :CODE_PTR :CONST+2
SCAN_BACKWARD:
at :TEMP2 :CODE_PTR
eq :TEMP2 :BF+6
not :TEMP2 :TEMP2
## Jump if found a [
jz :TEMP2 :BACKWARD_LPAR
at :TEMP2 :CODE_PTR
eq :TEMP2 :BF+7
not :TEMP2 :TEMP2
## Jump if found a ]
jz :TEMP2 :BACKWARD_RPAR
## Keep scanning
add :CODE_PTR :CONST+2
jz :CONST :SCAN_BACKWARD
## Decrement counter when finding a [
BACKWARD_LPAR:
add :TEMP :CONST+2
## If counter is 0, we're done
jz :TEMP :BF_RUN
## Else keep scanning
add :CODE_PTR :CONST+2
jz :CONST :SCAN_BACKWARD
## Increment counter when finding a ]
BACKWARD_RPAR:
add :TEMP :CONST+1
add :CODE_PTR :CONST+2
jz :CONST :SCAN_BACKWARD

CONST: 0 1 -1 10 
BF: ORD(>) ORD(<) ORD(+) ORD(-) ORD(.) ORD(,) ORD([) ORD(])
I: 0
TEMP: 0
TEMP2: 0
END: :PROG
DONE_READING: 0
CODE_PTR: :PROG
DATA_PTR: 5000

## We'll load the BF program here
PROG:

We can save this program as bf.asm. We will also create a Brainfuck program to run - Hello world:

++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.

We will save this as hello.bf. Now we can compile our BF interpreter and run it using our VM:

python3 asm.py bf.asm bf.vn
python3 vn.py bf.vn hello.bf

This should output Hello world!.

Since Brainfuck is Turing-complete and our VM can emulate a Brainfuck interpreter, our VM is also Turing-complete.

Summary

We talked about the von Neumann architecture and looked at a simple VM built using this architecture.
We created an assembler targeting this VM, to make it easier to write code that runs on the VM.
We looked at a couple of versions of Hello world, and saw how we can use variables and loops.
Finally, we implemented a Brainfuck interpreter that runs on the VM, proving our von Neumann machine is Turing-complete.

For convenience, the code we covered in this post is online here:

vn.py - virtual machine.
asm.py - assembler.
hello.asm - simple Hello world.
hello2.asm - Hello world using a loop.
bf.asm - Brainfuck interpreter.
hello.bf - Hello world in Brainfuck.

First Draft of a Report on the EDVAC. ↩

Computability Part 5: Elementary Cellular Automata

Wed, 06 Jul 2022 00:00:00 -0700

Computability Part 5: Elementary Cellular Automata

In the previous post we talked about Conway's Game of Life as a well-known cellular automaton. In this post we will cover even simpler automata - the elementary cellular automata. Stephen Wolfram covers them extensively in his book, A New Kind of Science.

To recap, we defined a cellular automaton as a discrete n-dimensional lattice of cells, a set of states (for each cell), a notion of neighborhood for each cell, and a transition function mapping the neighborhood of each cell to a new cell state.

An elementary cellular automaton is 1-dimensional - an array of cells. A cell can be either on or off (just like in Conway's Game of Life). The neighborhood of a cell, meaning the cells that we take into account when we determine the next state of the next generation, consists of the cell itself and its left and right neighbors.

For example, we can define an elementary cellular automaton with the following rules:

[ on,  on,  on] -> off
[ on,  on, off] -> off
[ on, off,  on] -> off
[ on, off, off] ->  on
[off,  on,  on] -> off
[off,  on, off] ->  on
[off, off,  on] ->  on
[off, off, off] -> off

If we start with a single on cell and produce 10 generations, we get (using # to mean on):

#
###
#   #
### ###
#       #
###     ###
#   #   #   #
### ### ### ###
#               #
###             ###

Rule encoding

The elementary cellular automata can easily be enumerated exhaustively: the neighborhood of a cell can be in only one of 8 states, as we saw above: [on, on, on], [on, on, off], ... [off, off, off]. The transition function maps each of these possible states to either on or off. If we think of the on/off as a bit, we need 8 bits to represent the transition function.

[ on,  on,  on] -> off
[ on,  on, off] -> off
[ on, off,  on] -> off
[ on, off, off] ->  on
[off,  on,  on] -> off
[off,  on, off] ->  on
[off, off,  on] ->  on
[off, off, off] -> off

can be represented as the binary number 00010110, which, in decimal, is 22 (where [off, off, off] is the least significant bit). We can represent numbers from 0 to 255 in 8 bits, so there are exactly 256 elementary cellular automata. This encoding is referred to as Rule as in transition rule. The elementary cellular automata in our above example is called Rule 22 .

Elementary cellular automata behavior

A common way to plot the evolution of an elementary cellular automata over multiple generation is to render each generation below the previous one, like our above example using # for on. A more condensed version with 1 pixel per cell of running rule 22 for 301 generations looks like this:

At this level, we can clearly see patterns emerging in the automaton. We get an even more interesting view if, instead of starting with just a single on cell, we start with a random state - an array of random on and off cells. Here is rule 22 starting with 301 random cells and running for 301 generations:

We can also easily see some of the automatons are complements of other automatons: if we simply flip each bit, we end up with a complementary version. Rule 22's complement is Rule 151:

We can also reflect a rule by swapping the transitions for [on, off, off] with [off, off, on] and [on, on, off] with [off, on, on]. This doesn't work for rule 22, since its reflection is still 22, but, for example, rules 3 and 17 are reflections of each other.

Rule 3:

[ on,  on,  on] -> off
[ on,  on, off] -> off
[ on, off,  on] -> off
[ on, off, off] -> off
[off,  on,  on] -> off
[off,  on, off] -> off
[off, off,  on] ->  on
[off, off, off] ->  on

Renders as:

Rule 17:

[ on,  on,  on] -> off
[ on,  on, off] -> off
[ on, off,  on] -> off
[ on, off, off] ->  on
[off,  on,  on] -> off
[off,  on, off] -> off
[off, off,  on] -> off
[off, off, off] ->  on

Renders as:

That means that, even though there are 256 possible automata, from behavioral perspective, some are complements or reflections of others thus exhibit the same behavior. In fact, there are only 88 uniquely behaving automata, all others being complements and/or reflections of these.

Implementation

Let's look at a Python implementation. We will represent the state of an automaton as a list of Boolean cells. We can encode the state of a neighborhood as a 3 bit number: left neighbor * 4 + cell * 2 + right neighbor. Given a list of cells and the index of a cell, we have:

def neighbors(cells, i):
    return (cells[i - 1] if i > 0 else False) * 4 + \
        cells[i] * 2 + \
        (cells[i + 1] if i < len(cells) - 1 else False)

If we run off the ends of the list, we assume the state of that cell is off. In Python, False becomes 0 and True becomes 1 if we do arithmetic with them, so this function will return a number between 0 and 7.

We can derive the transitions from the rule number by taking a rule number and expanding it into a dictionary that maps each value from 0 to 7 to the corresponding bit in the rule number value:

def transition(rule):
    return {i: rule & (1 << i) != 0 for i in range(8)}

This might be a bit hard to understand, so let's work through an example. Let's take Rule 22. The binary representation of Rule 22 is 00010110. We're iterating over the range 0...7 (i) and for each of these values, we shift 1 exactly i bits left. Then we check if the rule logic AND this shifted bit is different than 0.

For i == 0: 00010110 & (1 << 0), which is 00010110 & 00000001, we get False, so transitions[0] = False.

For i == 1: 00010110 & (1 << 1), which is 00010110 & 00000010, we get True, so transitions[1] = True.

...

For i == 7: 00010110 & (1 << 7), which is 00010110 & 10000000, we get False, so transitions[7] = False.

Remember the keys of the dictionary are neighborhood states.

Now we just need a function that takes a rule, an initial state, and the number of steps we want to run. The function will start with the initial state, then at each step, update the list of cells using the transition function:

def run(rule, initial_state, steps):
    t, cells = transition(rule), initial_state

    for _ in range(steps):
        yield cells
        cells = [t[neighbors(cells, i)] for i in range(len(cells))]

We talked about two ways to look at cellular automata: starting with a single on cell, or starting with a random initial state.

Let's implement an initial_state function which takes a cell count as input and returns a list of cells, all of which are off except the middle one:

def initial_state(cell_count):
    result = [False] * cell_count
    result[cell_count // 2] = True
    return result

We'll also want a random_initial_state which takes a cell count and returns a random cell list. We'll take advantage of the fact that Python supports arbitrarily large integers natively, so we'll just generate a random number with cell_count bits, then derive the cell list from that (if a bit is 1, the corresponding cell is on):

import random 

def random_initial_state(cell_count):
    seed = random.randint(0, 2 ** cell_count - 1)
    return [seed & (1 << i) != 0 for i in range(cell_count)]

Here is all the code in one listing:

def neighbors(cells, i):
    return (cells[i - 1] if i > 0 else False) * 4 + \
        cells[i] * 2 + \
        (cells[i + 1] if i < len(cells) - 1 else False)

def transition(rule):
    return {i: rule & (1 << i) != 0 for i in range(8)}

def run(rule, initial_state, steps):
    t, cells = transition(rule), initial_state

    for _ in range(steps):
        yield cells
        cells = [t[neighbors(cells, i)] for i in range(len(cells))]

def initial_state(cell_count):
    result = [False] * cell_count
    result[cell_count // 2] = True
    return result

import random 

def random_initial_state(cell_count):
    seed = random.randint(0, 2 ** cell_count - 1)
    return [seed & (1 << i) != 0 for i in range(cell_count)]

Here is how we can use this to print the first 30 steps of Rule 22:

for state in run(22, initial_state(61), 30):
    print(''.join(['#' if e else ' ' for e in state]))

Wolfram classification

Wolfram analyzed the behavior of cellular automata and classified them in 4 classes (called Wolfram classes). These go beyond elementary cellular automata to cover other cellular automata like, for example, ones where the next generation of a cell is not determined only by the cell and the two cells next to it, rather the neighborhood includes next-next cells. In this post we'll stick to elementary cellular automata though.

Class 1

Class 1 automata converge quickly to a uniform state. For example rule 0 becomes all off in one generation:

It's complement, rule 255, becomes all on in one generation:

Class 2

Class 2 automata converge quickly to a repetitive state. For example rule 4:

Class 3

Class 3 automata appear to remain in a random state, without converging. Rule 22, which we started with above, exhibits this type of behavior:

Class 4

The most interesting class of cellular automata, class 4, has a quite remarkable behavior: areas of cells end up in static or repetitive state, while some cells end up forming structures that interact with each other. Rule 110 is the only elementary cellular automaton that exhibits this behavior:

Turing completeness

The fact that Rule 110 has areas of cells that are static or repetitive while some other cells form structures should remind you of the Conway's Game of Life spaceships we discussed in the previous post. In the previous post, we saw that the Game of Life is Turing complete, and how a Turing machine was implemented using spaceships as signals processed by other patterns.

Turns out Rule 110 is also Turing complete. Stephen Wolfram conjectured this in 1985, and the conjecture was proved in 2004 by Matthew Cook¹. Cook uses Rule 110 gliders (interacting structures) to emulate a cyclic tag system. We saw in Computability Part 3: Tag Systems that cyclic tag systems can emulate tag systems, and an m-tag system with $m \gt 1$ is Turing complete.

Rule 110, an elementary cellular automaton, is also capable of universal computation. And while this all might seem very abstract, cellular automata are so simple they show up in nature:

See Universality in Elementary Cellular Automata. ↩

Computability Part 4: Conway's Game of Life

Sat, 11 Jun 2022 00:00:00 -0700

Computability Part 4: Conway's Game of Life

Formal definition:

A cellular automaton consists of a discrete n-dimensional lattice of cells, a set of states (for each cell), a notion of neighborhood for each cell, and a transition function mapping the neighborhood of each cell to a new cell state.

The system evolves over time, where at each step, the transformation function is applied over the lattice to determine the states of the next generation of cells.

Conway's Game of Life is a cellular automaton on a 2D plane with the following rules:

Any live cell with fewer than two live neighbors dies.

Any live cell with two or three live neighbors lives on to the next generation.

Any live cell with more than three live neighbors dies.

Any dead cell with exactly three live neighbors becomes a live cell.

In other words, a live cell stays alive during the next iteration if it has 2 or 3 live neighbors. A dead cell becomes live if it has exactly 3 live neighbors.

In the case of Conway's Game of Life, the lattice is a 2D grid, we have 2 states (on or off), the neighborhood of a cell consists of all adjacent cells (including corners), and the transition function is the one described above. Mathematician John Conway proposed the Game of Life in 1970.

The reason we started with Conway's Game of Life for discussing cellular automata is that this simple game with simple rules exhibits some very interesting behavior that has been classified for many years by people toying with the simulation.

First, we have still lives, patterns that don't change while stepping through the simulation. These patterns are stable: no cells die, no cells become live.

Next, we have oscillators, patterns that repeat with a certain periodicity:

In the above example, the last (bottom right) pattern has period 5 and is called Octagon 2. The other 3 patterns all have period 2.

More interestingly, we have spaceships - these are patterns that repeat but translate through space:

The above examples shows a couple of small spaceships, the tiny 5-cell glider and the lightweight spaceship or LWSS. There are many more spaceship patterns, some of them quite large (hundreds or even thousands of cells).

Most simulations tend to eventually stabilize into a combination of oscillators and still lives. Patterns that start from a small seed of a handful of cells and take a long time (in terms of iterations) to stabilize are called Methuselahs. Here is an example, nicknamed Acorn:

Conway conjectured that for any initial configuration, there is an upper limit of how many live cells can ever exist. This was proved wrong by the discovery of glider guns. A glider gun generates gliders every few iterations. The gliders continue moving away from the gun, thus running the simulation the number of live cells continues to grow.

One of the most popular glider guns is called Gosper glider gun, named after Mathematician and programmer Bill Gosper:

There are many other interesting patterns and constructions in the Game of Life discovered throughout the years. A few examples:

Eaters are still life or oscillator patterns that can interact and, over a number of iterations, absorb other patterns like spaceships, and return to their original state.
Reflectors are still life or oscillator patterns that can change the direction of incoming spaceships, and return to their original state.
Puffers are patterns that move like spaceships but leave behind a trail of patterns in their wake (unlike spaceships that cleanly translate).

There are many others, and combinations of them which give rise to interesting systems like circuits and logic gates based on spaceships and strategically placed still lives and oscillators.

Implementation

Let's look at a Python implementation for the Game of Life. We will use a wrap-around space, so we'll consider cells on the last column to be neighbors with cells on the first column and similarly cells on the last row to be neighbors with cells on the first row.

def make_matrix(width, height):
    return [[False] * width for _ in range(height)]

def neighbors(m, i, j):
    last_j = j + 1 if j + 1 < len(m[0]) else 0
    last_i = i + 1 if i + 1 < len(m) else 0

    return (m[i - 1][j - 1] + m[i - 1][j] + m[i - 1][last_j] +
        m[i][j - 1] + m[i][last_j] +
        m[last_i][j - 1] + m[last_i][j] + m[last_i][last_j])

def step(m1):
    m2 = make_matrix(len(m1[0]), len(m1))

    for i in range(len(m1)):
        for j in range(len(m1[0])):
            n = neighbors(m1, i, j)
            if n == 3:
                m2[i][j] = True
            elif n == 2 and m1[i][j]:
                m2[i][j] = True
    return m2

To run a simulation, we also need a function to print the game state and some initial conditions:

def print_matrix(m):
    for line in m:
        print(str.join('', ['#' if c else ' ' for c in line]))

m = make_matrix(10, 10)

m[0][1] = True
m[1][2] = True
m[2][0] = True
m[2][1] = True
m[2][2] = True

for _ in range(100):
    print_matrix(m)
    m = step(m)

Another very simple to implement system with powerful computational capabilities.

Turing completeness

It turns out the Game of Life is Turing complete, meaning it is also capable of universal computation. Gliders are key to this. In general, if the behavior of cells would be either repetitive (still life or oscillators cycle through 1 or more patterns) or chaotic, it would be hard to perform any computation. But gliders move and can interact with each other, thus enabling some non-chaotic processes.

We briefly discussed above how Game of Life patterns can be combined to form circuits that can process signals (in the form of spaceships) like logic gates and memory storage. Paul Rendell implemented a universal Turing machine in the Game of Life. His website (http://rendell-attic.org/gol/tm.htm) covers the details, which we won't go into due to the complexity. Suffice to say the patterns emerging in the Game of Life can be combined to build such a device. Paul also wrote a book about it¹.

We again encountered a system capable of computing anything computable, based only on a matrix of cells and a couple of rules (live cells with 2 or 3 neighbors stay alive, dead cells with exactly 3 neighbors become live).

The website https://conwaylife.com/ includes a lot of details on Conway's Game of Life, various patterns discovered, and a forum where people discuss their exploration of the system.

In the next post, we'll look at even simpler cellular automata: elementary cellular automata where cells have 2 possible states and 2 neighbors.

See Turing Machine Universality of the Game of Life. ↩

Computability Part 3: Tag Systems

Fri, 20 May 2022 00:00:00 -0700

Computability Part 3: Tag Systems

In the previous post we talked about universal Turing machines and looked at some very small machines that are still capable of computing anything that can be computed (the Turing-completeness property). In this post, we'll look at another model for computation: tag systems.

A tag system operates on a string of symbols by reading the symbol from the head of the string, deleting a constant number of symbols from the head of the string, and appending one or more symbols to the tail of the string based on the symbol read from the head.

Formally:

A tag system is a triplet $\langle m, A, P \rangle$.

$m$ is a positive integer, called the deletion number, which specifies how many symbols are deleted from the head during each iteration.

$A$ is a finite alphabet of symbols, including a special halting symbol.

$P$ is a set of production rules which map each symbol in $A$ to a string of symbols or words from $A$ (to be appended to the end of the string).

Tag systems were specified by Emil Leon Post in 1943, 7 years after Turing Machines. We usually refer to tag systems as m-tag systems where $m$ is the deletion number from the definition above.

At each step, $x$ is read from the head of the string, $m$ symbols are deleted, and $P(x)$ is appended to the end of the string. The tag system halts when $x$ is the halting symbol.

An alternative definition that doesn't require a halting symbol considers as halting all words that are smaller than $m$. In this case, the tag system halts when the string shrinks sufficiently. Yet another alternative considers as halting the empty string. In this case, the tag system halts when the string becomes empty.

Let's look at a Python implementation for a tag system:

def tag_system(m, productions, string):
    # Repeat until the string is empty or we see the halting symbol
    while string and string[0] in productions:
        string = string[m:] + productions[string[0]]

        yield string

As an example, let's take the tag system with $m = 2, A = \langle a, b, H \rangle$, and the production rules

Symbol	Word
a	aab
b	H

Starting with the string aa, the steps are:

aa              // Erase 2 symbols from head, a -> aab
  aab           // Erase 2 symbols from head, a -> aab
    baab        // Erase 2 symbols from head, b -> H
      abH       // Erase 2 symbols from head, a -> aab
        Haab    // Halt

Using our tag_system() function implemented above:

productions = {
    'a': 'aab',
    'b': 'H',
}

string = 'aa'

print(string)
for string in tag_system(2, productions, string):
    print(string)

Tag systems are simple, even simpler than Turing machines. Remember we defined a Turing machine as a 7-tuple while tag systems are represented by triplets. Turing machines have states, and depending on the state, a machine takes different actions. Tag systems technically have a single state: when a symbol is read from the head of the string, the same thing will always happen: $m$ symbols are deleted from the head and the corresponding production rule determines what word to append to the tail of the string. Even so, tag systems are Turing-complete.

Turing completeness

For $m \gt 1$, m-tag systems are Turing complete. For any Turing machine, there is an m-tag system that can emulate that Turing machine. John Cocke and Marvin Minsky showed in 1964 how a 2-tag system can emulate a universal Turing machine¹. That means that such a super simple system is also capable of universal computation!

But it gets even simpler.

Cyclic tag systems

A cyclic tag system is a modification of tag systems where:

$m = 1$: only one symbol is deleted from the head of the string.
The alphabet consists of only 0 and 1.
Instead of production rules, we use a finite list of words (on the alphabet consisting of only 0 and 1) called productions.

Instead of production rules, we cycle through the list of productions. We start from the head of the list of productions. At each step, if the symbol at the head of the string is 1, we append the production to the end of the string. If the symbol at the head of the string is 0, we don't append anything. We then move to the next production in the list for the next step. Once we exhaust the list of productions, we loop around to the head (this inspired the cyclic name).

Here is a Python implementation for a cyclic tag system:

def cyclic_tag_system(productions, string):
    # Keeps track of current production
    i = 0

    # Repeat until the string is empty
    while string:
        string = string[1:] + (productions[i] if string[0] == '1' else '')

        # Update current production
        i = i + 1
        if i == len(productions):
            i = 0

        yield string

For example, we will use the production rules 11, 01, 00. With an initial string 1, the steps are:

1               // Append production 11
 11             // Append production 01
  101           // Append production 00
   0100         // Current production 11 (won't append since head is 0)
    100         // Append production 01
     0001       // Current production 00 (won't append since head is 0)
      001       // Current production 11 (won't append since head is 0)
       01       // Current production 01 (won't append since head is 0)
        1       // Append production 00
         00     // Current production 11 (won't append since head is 0)
          0     // Current production 01 (won't append since head is 0)
                // Halts

Using our Python implementation:

productions = ['11', '01', '00']

string = '1'

print(string)
for string in cyclic_tag_system(productions, string):
    print(string)

Cyclic tag systems are simpler than tag systems since $m$ is fixed to 1, the alphabet is fixed to 0 and 1, and productions are a represented as a cyclic list rather than a map of symbols to words. Even so, a cyclic tag system can emulate any m-tag system.

Emulating tag systems with cyclic tag systems

An m-tag system with $n$ symbols $\lbrace a_1, a_2, ... a_n \rbrace$ and their corresponding production rules $\lbrace P_1, P_2, ... P_n \rbrace$ can be translated to a cyclic tag system with $m * n$ productions where the first $n$ productions $\lbrace P'_1, P'_2, ... P'_n \rbrace$ are encodings of their respective $P$-productions in the m-tag system and the rest are empty strings.

Productions in the m-tag system are words over the alphabet $A$. We encode each symobl in $A$ as a binary string of length $n$, with a 1 in the $k$-th position for $a_k$. For example, for $n = 3$ and the alphabet $A = \lbrace a_1, a_2, a_3 \rbrace$, we encode $a_1$ as 100, $a_2$ as 010, $a_3$ as 001. Since a production $P_k$ is a sequence of symbols, we can similarly translate it into an encoded representation $P'_k$ using symbols 0 and 1.

Our first example was the 2-tag system with the alphabet $A = \langle a, b, H \rangle$, and the production rules

Symbol	Word
a	aab
b	H
H	H

Here we added the production rule H -> H for completeness, so we have exactly $n$ production rules.

Translating this into a cyclic tag system, $a, b, H$ become 100, 010, and 001 respectively. The production rules translate as:

a -> aab becomes 100100010

b -> H becomes 001

H -> H becomes 001

The full list of production for the cyclic tag system is 100100010, 001, 001, -, -, - where - is the empty string.

The initial string aa becomes 100100, so our emulation is:

100100                      // * Emulated production rule a -> aab
 00100100100010             // P = 001 (but head is 0)
  0100100100010             // P = 001 (but head is 0)
   100100100010             // P = empty string
    00100100010             // P = empty string, head is 0
     0100100010             // P = empty string, head is 0
      100100010             // * Emulated production rule a -> aab
       00100010100100010    // P = 001 (but head is 0)
        0100010100100010    // P = 001 (but head is 0)
         100010100100010    // P = empty string
          00010100100010    // P = empty string, head is 0
           0010100100010    // P = empty string, head is 0
            010100100010    // P = 100100010 (but is 0)
             10100100010    // * Emulated production rule b -> H
              0100100010001 // P = 001 (but head is 0)
               100100010001 // P = empty string
                ...

Using our Python implementation:

productions = ['100100010', '001', '001', '', '', '']

string = '100100'

print(string)
for string in cyclic_tag_system(productions, string):
    print(string)

Note in this case the cyclic tag system won't halt when the emulated m-tag system halts, since that would be an emulated halt. But we can stop it by checking whether the first 3 symbols represent the encoding of H. We do this every sixth step, since we have a 2-tag system with 3 symbols, which means we emulate 1 step of the tag system with 6 steps of the cyclic tag system.

productions = ['100100010', '001', '001', '', '', '']

i, string = 0, '100100'

print(string)
for string in cyclic_tag_system(productions, string):
    print(string)

    i = (i + 1) % 6

    # Break if halting symbol is at the head of the string
    if i == 0 and string[:3] == '001':
        break

Or, an updated example that prints every sixth step and translates from the cyclic tag system encoding to the original symbols:

productions = ['100100010', '001', '001', '', '', '']

symbols = {
    '100': 'a',
    '010': 'b',
    '001': 'H',
}

def translate(s):
    return ''.join([symbols[s[i:i + 3]] for i in range(0, len(s), 3)])

i, string = 0, '100100'

print(f'{string} ({translate(string)})')
for string in cyclic_tag_system(productions, string):
    i = (i + 1) % 6
    if i == 0:
        print(f'{string} ({translate(string)})')
        if string[:3] == '001':
            break

Running this code should be the emulated equivalent of our first example in this post.

Since m-tag systems (with $m \gt 1$) are Turing-complete and cyclic tag systems can emulate any m-tag system, it follows that cyclic tag systems are also Turing complete. We can compute anything that is computable with the alphabet 0, 1, and a list of words over this alphabet!

In the next post, we will continue our exploration of simple systems capable of universal computation with cellular automata.

See Universality of Tag Systems With P = 2. ↩

Computability Part 2: Turing Machines

Sun, 03 Apr 2022 00:00:00 -0700

Computability Part 2: Turing Machines

In the previous post, we looked at a history of what would become computer science. In this post, we'll focus on Turing machines and Turing completeness.

The informal definition we gave to a Turing machine in the previous post is:

An abstract computer consisting of an infinite tape of cells, a head that can read from a cell, write to a cell, and move left or right over the tape, and a set of rules which direct the head based on the read symbol and the current state of the machine.

Formally:

A Turing machine is a 7-tuple $M = \langle Q, q_0, F, \Gamma, b, \Sigma, \delta \rangle$.

$Q \ne \varnothing$ is a finite set of states. These are all the states the machine can be in.

$q_0 \in Q$ is the initial state. This is the state the machine starts in.

$F \subseteq Q$ is the set of final states. When the machine reaches one of the final states, it halts - it stops execution.

$\Gamma \ne \varnothing$ is a finite set of tape symbols. These are all the symbols that can appear on the tape.

$b \in \Gamma$ is the blank symbol, one of the possible tape symbols. The only symbol allowed to occur on the tape infinitely often at any step.

$\Sigma \subseteq \Gamma \setminus \lbrace b \rbrace$ is the set of input symbols allowed to appear in the initial tape contents (not written by the machine during execution). These symbols can be the whole alphabet (except the blank symbol), or a subset of the alphabet.

$\delta: (Q \setminus F) \times \Gamma \to Q \times \Gamma \times \lbrace L, R \rbrace$ is a function called the transition function. This functions takes as input the current machine state and the symbol on the tape. It outputs the new machine state, the symbol to overwrite the current tape symbol, and the head movement (either left or right). Note the function domain excludes the final states - once the machine reaches a state in $F$, it halts so no more transitions happen.

Alternately, the transition function can be defined as a partial function $\delta: Q \times \Gamma \hookrightarrow Q \times \Gamma \times \lbrace L, R \rbrace$, where the machine halts if the function is undefined for the given combination of machine state and tape symbol. In some compact Turing machines (like we'll see below), $F$ is empty. There is not final state, rather we halt when encountering a certain combination of machine state and tape symbol for which no transition is defined.

Note this definition allows for some very uninteresting machines: a machine that only has an initial and a final state ($Q = \lbrace q_0, f \rbrace$) and, for any input symbol in $\Gamma$, the transition function moves the machine into the final state. This is a Turing machine, but it can't really compute much. Something more is needed.

Universal Turing machines

A universal Turing machine is a Turing machine that can simulate another, arbitrary, Turing machine on arbitrary input. That is, it can read the description of a Turing machine and that machine's input as its own input, then simulate the execution of that machine.

With this definition, a universal Turing machine can compute anything any other Turing machine can compute (anything that is computable).

Marvin Minsky discovered a universal Turing machine that requires only 7 states and 2 symbols. Yurii Rogozhin discovered a machine with only 4 states and 6 symbols. Let's call the states $Q = \lbrace A, B, C, D \rbrace$ and the symbols $\Gamma = \lbrace 0, 1, 2, 3, 4, 5 \rbrace$.

(4, 6) Turing Machine

	A	B	C	D
0	3,L,A	4,R,B	0,R,C	4,R,D
1	2,R,A	2,L,C	3,R,D	5,L,B
2	1,L,A	3,R,B	1,R,C	3,R,D
3	4,R,A	2,L,B	HALT	HALT
4	3,L,A	0,L,B	5,R,A	5,L,B
5	4,R,D	1,R,B	0,R,A	1,R,D

The above table describes the transition function of the Turing machine. For example, if the machine is in state A and the read tape symbol is 5, we can look up the A column and 5 row to find the transition 4,R,D. This means print 4 on the tape (overwriting the current symbol), move the head right (R), machine is now in state D.

We're using the partial transition function definition, so instead of defining one or more explicit final states ($F$), we don't define a transition when the tape symbol is 3 and the machine is in state C or state D.

Implementation

Let's look at a Python implementation of Turing machines. First, let's implement the tape we will be using. Theoretically this is an infinite tape. To simulate this in software, we will use a list and whenever we move the head left or right beyond the list, we extend the list with an additional blank symbol:

class Tape:
    def __init__(self, tape, head = 0):
        # Initial tape should have at least one symbol
        assert(len(tape) >= 1)
        # Tape head should be a valid index
        assert(0 <= head < len(tape))

        self.tape = tape
        self.head = head

    def read(self):
        return self.tape[self.head]

    def write(self, symbol):
        self.tape[self.head] = symbol

    def move_left(self):
        # If attempting to move left out of bounds, extend tape left
        if self.head == 0:
            self.tape.insert(0, 0)
        else:
            self.head -= 1

    def move_right(self):
        self.head += 1
        # If attempting to move right out of bounds, extend tape right
        if self.head == len(self.tape):
            self.tape.append(0)

We'll implement a machine that takes a tape, a transition table, and an initial state, and runs until it halts:

def machine(tape, transitions, state):
    while True:
        symbol = tape.read()

        # If no transition is defined for the current state and symbol, halt
        if not transitions[state][symbol]:
            break

        new_symbol, direction, new_state = transitions[state][symbol]

        tape.write(new_symbol)
        tape.move_left() if direction == 'L' else tape.move_right()
        state = new_state

To stich this together, we need a transition table and initial tape state. We'll use the Rogozhin (4, 6) machine:

## Machine states
A, B, C, D = 'A', 'B', 'C', 'D'

## Left and right
L, R = 'L', 'R'

## Rogozhin 4-state, 6-symbol Turing machine
transition = {
    A: [(3, L, A), (2, R, A), (1, L, A), (4, R, A), (3, L, A), (4, R, D)],
    B: [(4, R, B), (2, L, C), (3, R, B), (2, L, B), (0, L, B), (1, R, B)],
    C: [(0, R, C), (3, R, D), (1, R, C), None, (5, R, A), (0, R, A)],
    D: [(4, R, D), (5, L, B), (3, R, D), None, (5, L, B), (1, R, D)],
}

This machine is a universal Turing machine, meaning it can simulate any other turing machine, thus is capable of universal computation (can compute anything that is computable).

Turing-completeness

A Turing-complete system is any system capable of simulating any Turing machine.

Turing-completeness is a way of expressing the computational power of a given system. A Turing-complete system is capable of universal computation. The small Rogozhin (4, 6) machine, since it is a universal Turing machine, is Turing-complete.

More so, the fact that we can simulate this machine in the Python programming language proves that the Python language itself is Turing-complete.

Esoteric Turing-complete systems

If we weaken some of the constraints for Turing machines, there are even smaller weak universal Turing machines. For example, if we allow the tape to contain an infinitely repeated sequence of symbols, or we don't require the machine to ever halt.

The smallest weak Turing machine is a Turing machine consisting of 2 states and 3 symbols. Let's call the states $Q = \lbrace A, B \rbrace$ and the symbols $\Gamma = \lbrace 0, 1, 2 \rbrace$.

(2, 3) Turing Machine

	A	B
0	1,R,B	2,L,A
1	2,L,A	2,R,B
2	1,L,A	0,R,A

Stephen Wolfram in A New Kind of Science (a book we'll get back to in a future post) described a 2-state 5-symbol universal Turing machine and conjectured the 2-state 3-symbol machine is also universal. The universality of the 2-state 3-symbol machine was proved in 2007.

In terms of Turing-complete programming languages, a somewhat famous esoteric programming langue is Brainfuck. Brainfuck uses a byte array (tape), a data pointer (index in the array), and 8 symbols: >, <, +, -, ., ,, [, ]. The symbols are interpreted as:

>: Increment the data pointer (move head right).
<: Decrement the data pointer (move head left).
+: Increment array value at data pointer.
-: Decrement array value at data pointer.
.: Output value at data pointer.
,: Read 1 byte of input and store at data pointer.
[: If the byte at data pointer is 0, jump right to the matching ], else increment data pointer
]: If the byte at data pointer is not 0, jump left to the matching [, else decrement data pointer

This simple language is very much modeled after a Turing machine. Here is Hell World! in Brainfuck:

++++++++[>++++[>++>+++>+++>+<<<<-]>+>+>->>+[<]<-]>>.>
---.+++++++..+++.>>.<-.<.+++.------.--------.>>+.>++.

Since the language definition is so simple, it is very easy to write a Brainfuck interpreter:

import sys

def bf(program):
    # Data array, data pointer, and code pointer
    data, dp, cp = [0], 0, 0

    while cp < len(program):
        match program[cp]:
            case '<':
                dp -= 1
            case '>':
                dp += 1
                if dp == len(data):
                    data.append(0)
            case '+':
                data[dp] += 1
            case '-':
                data[dp] -= 1
            case '.':
                print(chr(data[dp]), end='')
            case ',':
                data[dp] = ord(sys.stdin.read(1))
            case '[':
                if data[dp] == 0:
                    opened = 1
                    while opened:
                        cp += 1
                        if program[cp] == ']':
                            opened -= 1
                        elif program[cp] == '[':
                            opened += 1
            case ']':
                if data[dp] != 0:
                    opened = 1
                    while opened:
                        cp -= 1
                        if program[cp] == '[':
                            opened -= 1
                        elif program[cp] == ']':
                            opened += 1
        cp += 1

Also note that any programming language that can implement a Brainfuck interpreter is Turing-complete (since Brainfuck is Turing-complete).

There's also some surprising proofs of unintentional Turing-completeness. For example, C++ template metaprogramming was proved to be Turing-complete (not the C++ language itself, which is obviously Turing-complete, just the template part alone). Magic: The Gathering is also Turing-complete. Turing-completeness comes in many forms. In the next posts, we'll look at some other models of universal computation: tag systems and cellular automata.

Computability Part 1: A Short History

Sat, 12 Feb 2022 00:00:00 -0800

Computability Part 1: A Short History

An algorithm (/ËÃ¦lÉ¡ÉrÉªÃ°Ém/ ) is a finite sequence of well-defined instructions, typically used to solve a class of specific problems or to perform a computation.

Ancient computers

The first computer (we know of) is the Antikythera mechanism. It was found in 1901 in a shipwreck. The device was built sometime between 100 BC and 150 BC and uses gears to predict astronomical positions of the Sun, Moon, and planets through the zodiac.

Image from Wikimedia Commons user Marsyas, CC BY 2.5

This millennia old device is a hand-powered analog computer. Humanity has been looking at automating computation for quite some time.

The father of computer science

Skipping forward a few hundred years, the famous Gottfried Wilhelm Leibniz (1646-1714) designed the first device that could perform the 4 arithmetic operations and used an internal memory. He also invented the binary system, and his Algebra of Thought is a precursor to Boolean Algebra. Leibniz is famous as a mathematician (inventing calculus independently of Isaac Newton), but some also call him the father of computer science.

After creating his arithmetic machine, Leibniz dreamt of a machine that could manipulate symbols in order to decide the truth value of mathematical statements.

The Difference Engine and the Analytical Engine

Over a century later, Charles Babbage (1791-1871) invents the Difference Engine, a mechanical calculator that can tabulate polynomial functions. Babbage created a small version of this, the Difference Engine 0, in 1822. Work on a larger version, which was supposed to enable larger calculations, was funded by the British government. Unfortunately, this did not materialize due to the manufacturing limitations of the time. It took 20 years and large amounts until the project was abandoned. The Difference Engine 1 was never completed.

During this time, Babbage started thinking about a general-purpose computer, the Analytical Engine. The Analytical Engine would include an arithmetic logic unit, control flow, and memory - components of modern electronic computers. The programming language resembled modern day assembly languages and would have been fed to the computer through punch cards. This machine was never built.

Even though the physical Analytical Engine did not materialize, several programs were created for it, both by Babbage and Ada Lovelace (1815-1852). Ada published the first algorithm for the Analytical Engine, used to compute Bernoulli numbers, and is regarded as the first programmer.

The foundational crisis of mathematics

At the beginning of the 20th century, mathematicians were looking for a proper foundation for mathematics: a set of axioms from which all theorems could be derived.

David Hilbert (1862-1943) put forward 23 problems in 1900, which heavily influenced the direction of mathematics research in the 20th century. Some of the problems have since been solved, others, like the famous Riemann hypothesis (problem 8), are still unresolved.

The 2nd problem, directly tying into the foundational crisis, was to prove that the axioms of arithmetic are consistent (meaning no contradictions can arise as theorems are derived from the axioms).

Alfred North Whitehead (1861-1947) and Bertrand Russell (1872-1970) start working on the Principia Mathematica. 3 volumes are published in 1910, 1912, and 1913. Starting with a minimum set of primitive notions, axioms, and inference rules, they deduce theorems pertaining to logic, arithmetic, set theory and so on. Famously, the proof that 1+1=2 appears on page 379 of volume 1.

Kurt GÃ¶del (1906-1978) proves, with his incompleteness theorem (1930), that a formal system powerful enough to describe arithmetic cannot be both consistent and complete. In other words, starting from a set of axioms, if these are consistent (no contradictions can be derived), they cannot be complete (there will be true statements that cannot be derived from these axioms).

Building upon this work, in 1933, GÃ¶del develops general recursive functions as a model of computability (more on this later).

Entscheidungsproblem and models of computability

David Hilbert proposes another challenge in 1928: the decision problem. The problem asks for an algorithm that takes a statement as an input and decides whether the statement is provable within the considered set of axioms. Note that GÃ¶del's incompleteness theorem shows that some true statements cannot be proved from a consistent set of axioms. That doesn't mean there isn't an algorithm that can decide whether a statement is provable or not. Hilbert believed such an algorithm exists.

Alonzo Church (1903-1995) develops lambda calculus as a model of computation that uses function abstraction, application, and variable binding and substitution. Church's Theorem (1936) provides a negative answer to the decision problem, based on lambda calculus. He shows there is no computable function that can decide whether two lambda expressions are equivalent.

During the same time, Alan Turing (1912-1954) develops another model of computation: the Turing machine. This is an abstract computer consisting of an infinite tape of cells, a head that can read from a cell, write to a cell, and move left or right over the tape, and a set of rules which direct the head based on the read symbol and the current state of the machine. Turing also provides a negative answer to the decision problem during the same year as Church (1936), based on Turing machines: he shows that there is no general method to decide whether any given Turing machine halts or not (the halting problem).

Universal computability and Turing completeness

These are remarkable results: we now have proof that some problems are incomputable. More than that, we know that a Turing machine can compute anything that is computable.

The Church-Turing thesis shows that lambda calculus can be used to simulate a Turing machine. That means that lambda calculus can compute anything that a Turing machine can compute, thus the two systems have the same computability power.

In general, if a system can be used to simulate a Turing machine, this makes it Turing complete, meaning capable of computing anything that is computable.

GÃ¶del's general recursive functions are also shown to be an equivalent model of computation (these are the functions that Turing machines can compute).

We have 3 quite different approaches to universal computability: general recursive functions, lambda calculus, and Turing machines. These turn out to all be equivalent in terms of what is possible to compute.

Turing machines, with their simple definition, are easy to simulate, thus making Turing completeness the preferred way of proving that a system is capable of universal computation.

Timsort

Thu, 30 Dec 2021 00:00:00 -0800

Timsort

I wrote before about the inherent complexity of the real world and how software that behaves well in the real world must necessarily take on some complexity (Time and Complexity). A lot of the software engineering best practices try to reduce or eliminate the accidental complexity of large systems (making things more complicated than they should be). But we don't live in a perfect world, so modeling it using software requires some inherent complexity in the software, to reflect reality. One of the algorithms which perfectly illustrates this is the Timsort sorting algorithm.

Timsort is an algorithm developed by Tim Peters in 2002 to replace Python's previous sorting algorithm. It has since been adopted inJava's OpenJDK, the V8 JavaScript engine, and the Swift and Rust languages. This is a testament that Timsort is a performant sort.

Timsort is a stable sorting algorithm, which means it will never changes the relative order of equal elements. This is an important property in certain situations. This is not important when sorting numbers, but becomes important when sorting objects with custom comparisons.

But in 2002 we already had plenty of well known sorting algorithms which were quite efficient. How did Timsort manage to outperform these?

Merging Runs

The key insight of Timsort is that in the real world, many lists of elements that require sorting contain subsequences of elements that are already sorted. These are called runs and tend to appear naturally. For example, in the list [5, 2, 3, 4, 9, 1, 6, 8, 10, 7] we have two runs: [2, 3, 4] and [1, 6, 8, 10].

If we know runs will show up more often than not in our input, how can we best leverage this to our advantage, and avoid extraneous comparisons and data movement?

Timsort starts by finding the minimum accepted run length for a given input. This doesn't have anything to do with the content of the input, rather it is a function of the size of the input. More on this later.

Then we do a single pass over the array and identify consecutive runs. If the next minimum accepted run length elements are not already sorted (they don't form a run), we sort them using insertion sort (so they do end up as a run). We push these runs on a stack, then we then merge pairs of them until we end up with a single run, which is our sorted list.

A Simple Implementation

Let's start with a simple sketch implementation. We'll use Python since it is expressive and it makes it easier to focus on the algorithm rather than syntax around it.

MIN_MERGE = 4

def sort(arr): 
    lo, hi = 0, len(arr) 
    stack = []
    nRemaining = hi
    minRun = MIN_MERGE

    while nRemaining > 0:
        runLen = min(nRemaining, minRun)
        insertionSort(arr, lo, lo + runLen)
        stack.append((lo, runLen))

        lo += runLen
        nRemaining -= runLen

    while len(stack) > 1:
        base2, len2 = stack.pop()
        base1, len1 = stack.pop()
        merge(arr, base1, base2, base2 + len2)
        stack.append((base1, len1 + len2))

First, we initialize a few variables:

lo, hi = 0, len(arr) 
stack = []
nRemaining = hi
minRun = MIN_MERGE

MIN_MERGE represents the minimum number of elements we want to merge, and is a constant. We'll talk more about this once we look at some optimizations later on.

lo and hi represent the range in the array we will operate on. Note ranges are always half-open (arr[lo] included, arr[hi] excluded, potentially out of bounds). stack is the run stack, nRemaining is the number of elements we still need to process. minRun is the minimum run length. For this first iteration, we'll just use MIN_MERGE.

Next, we traverse the array and come up with our runs:

while nRemaining > 0:
        runLen = min(nRemaining, minRun)
        insertionSort(arr, lo, lo + runLen)
        stack.append((lo, runLen))

        lo += runLen
        nRemaining -= runLen

Our run in this case will be the minimum between minRun and the remaining elements of the array (so for the final run, we don't go out of bounds). We sort the run using insertionSort, then we push the run start index and length onto the stack. We advance lo by the length of the run and we similarly decrement nRemaining, the number of elements still to be processed.

Next, we merge the runs:

while len(stack) > 1:
    base2, len2 = stack.pop()
    base1, len1 = stack.pop()
    merge(arr, base1, base2, base2 + len2)
    stack.append((base1, len1 + len2))

We pop 2 runs from the top of the stack, merge them, and push the new run back onto the stack. With this basic implementation, a stack is technically not really needed, but I'm trying to preserve the general shape of the optimized solution.

We called a couple of helper functions: insertionSort and merge. Here is insertionSort:

def insertionSort(arr, lo, hi): 
    i = lo + 1

    while i < hi:
        elem = arr[i]
        j = i - 1

        while elem < arr[j] and j >= lo:
        j -= 1

        arr.pop(i)
        arr.insert(j + 1, elem)

        i += 1

Insertion sort traverses the array from the lower bound + 1 to the higher bound and maintains the invariant that all elements preceding i are sorted. So for any element arr[i], we find a spot j in the range [lo, i) where this element should fit. We then insert it there and shift the remaining elements in [j + 1, i) one spot to the right. Note this algorithm is quite inefficient on large data sets, but performs well on small inputs.

Our merge algorithm is:

def merge(arr, lo, mid, hi):
    t = arr[lo:mid]
    i, j, k = lo, mid, 0

    while k < mid - lo and j < hi:
        if t[k] < arr[j]:
            arr[i] = t[k]
            k += 1
        else:
            arr[i] = arr[j]
            j += 1
        i += 1

    if k < mid - lo:
        arr[i:hi] = t[k:mid - lo]

We are merging the consecutive (sorted) ranges [lo, mid) and [mid, hi). One way to do this (which our implementation uses), is to copy [lo, mid) to a temporary buffer t. We then traverse the [mid, hi) range with j and the buffer with k. We pick the smallest of t[k] and arr[j] to insert at arr[i] (incrementing the corresponding index), then we increment i. At some point, either j or k reaches the end. If j makes it to the end first, it means we still have some elements in t we need to copy over. If k makes it to the end first, we don't need to do anything: the remaining elements in [j, hi) are where they are supposed to be.

We now have a full implementation of a very simple Timsort. If we run it on the [5, 2, 3, 4, 9, 1, 6, 8, 10, 7] input, the following steps take place:

We pick up the first run, [5, 2, 3, 4] and sort it using insertionSort. This becomes [2, 3, 4, 5]. We push its start index and length on the stack ((0, 4)).
We next take [9, 1, 6, 8], sort it to [1, 6, 8, 9], and push (4, 4) on the stack.
Finally, we only have [10, 7]. We sort this short run to [7, 10] and push (6, 2) on the stack.

Note all our sorting happens in-place, so by now the whole input became [2, 3, 4, 5, 1, 6, 8, 9, 7, 10]. We then proceed to merge runs from the top of the stack:

First, we merge [1, 6, 8, 9] with [7, 10, which yields [1, 6, 7, 8, 9, 10]. We pop the two runs from the stack and push (4, 6), the index and length of this new run.
Next, we merge [2, 3, 4, 5] with [1, 6, 7, 8, 9, 10], and update the stack accordingly. At this point, we only have 1 run on the stack ([0, 10)). We are done.

Some Optimizations

So far, we haven't relied that much on the fact that our input might be naturally partially sorted. Instead of simply calling insertionSort on minRun elements, we can actually check whether elements are already ordered. If they are, we don't need to do anything with them. Even better, if the run of elements is longer than minRun, we keep going.

Elements might also come naturally sorted in descending order, while we are sorting in ascending order. No problem: we can take a range of elements coming in descending order and reverse it to produce a run in ascending order. Let's call this function countRunAndMakeAscending:

def countRunAndMakeAscending(arr, lo, hi):
    runHi = lo + 1
    if runHi == hi:
        return 1

    if arr[lo] > arr[runHi]: # Descending run
        while runHi < hi and arr[runHi] < arr[runHi - 1]:
            runHi += 1
        reverseRange(arr, lo, runHi)
    else: # Ascending run
        while runHi < hi and arr[runHi] >= arr[runHi - 1]:
            runHi += 1

    return runHi - lo

We return the length of the run starting from lo, going to at most hi - 1. If we have a natural descending run, we reverse the range before returning. Here is reverseRange:

def reverseRange(arr, lo, hi):
    hi -= 1
    while lo < hi:
        arr[lo], arr[hi] = arr[hi], arr[lo]
        lo += 1
        hi -= 1

We can't get rid of sorting though: we might have worst-case scenario cases with very small runs, in which case we still need a range of at least minRun size. Based on the result of countRunAndMarkAscending, if it is smaller than minRun, we will force a few more elements into the run and sort it. Our new implementation looks like this:

def sort(arr): 
    lo, hi = 0, len(arr) 
    stack = []
    nRemaining = hi
    minRun = MIN_MERGE

    while nRemaining > 0:
        runLen = countRunAndMakeAscending(arr, lo, hi)

        if runLen < minRun:
            force = min(nRemaining, minRun)
            insertionSort(arr, lo, lo + force)
            runLen = force

        stack.append((lo, runLen))

        lo += runLen
        nRemaining -= runLen

    while len(stack) > 1:
        base2, len2 = stack.pop()
        base1, len1 = stack.pop()
        merge(arr, base1, base2, base2 + len2)
        stack.append((base1, len1 + len2))

Highlighting the changed part:

runLen = countRunAndMakeAscending(arr, lo, hi)

if runLen < minRun:
    force = min(nRemaining, minRun)
    insertionSort(arr, lo, lo + force)
    runLen = force

Instead of simply taking the next minRun elements, we try to find a run. If the run we find is smaller than minRun, we force it to be minRun by insertion-sorting into it more elements. If it is larger than or equal to minRun on the other hand, we don't have to do any sorting.

It gets better: now we know after calling countRunAndMakeAscending that the range [lo, lo + runLen) is already sorted. We can hint this to our sorting function and have it start sorting only from lo + runLen. We can update insertionSort to take a hint of where to start from:

def insertionSort(arr, lo, hi, start): 
    if start == lo:
        start += 1

    while start < hi:
        elem = arr[start]
        j = start - 1

        while elem < arr[j] and j >= lo:
        j -= 1

        arr.pop(start)
        arr.insert(j + 1, elem)

        start += 1

This version is very similar to our previous one. Instead of using a local i variable to iterate over the range [lo + 1, hi), we just use start. If start is lo, we increment it before the loop (just like we used to initialize i to lo + 1).

We can now pass this hint in from our main function:

while nRemaining > 0:
    runLen = countRunAndMakeAscending(arr, lo, hi)

    if runLen < minRun:
        force = min(nRemaining, minRun)
        insertionSort(arr, lo, lo + force, lo + runLen)
        runLen = force

At this point, we're starting to get a lot of value from naturally sorted runs: we either don't do any sorting, or just sort at most minRun - runLen elements into the range.

A further optimization for sorting: we can replace insertion sort with binary sort. Binary sort works much like insertion sort, but instead of checking where element i fits into [lo, i) by comparing it with i - 1, then i - 2 and so on, it relies on the fact that [lo, i) is already sorted and performs a binary search to find the right spot. Here is an implementation, which also takes a start hint:

def binarySort(arr, lo, hi, start):
    if start == lo:
        start += 1

    while start < hi:
        pivot = arr[start]
        left, right = lo, start

        while left < right:
            mid = (left + right) // 2

            if pivot < arr[mid]:
                right = mid
            else:
                left = mid + 1

        arr.pop(start)
        arr.insert(left, pivot)

        start += 1

Our main function now looks like this:

def sort(arr): 
    lo, hi = 0, len(arr) 
    stack = []
    nRemaining = hi
    minRun = MIN_MERGE

    while nRemaining > 0:
        runLen = countRunAndMakeAscending(arr, lo, hi)

        if runLen < minRun:
            force = min(nRemaining, minRun)
            binarySort(arr, lo, lo + force, lo + runLen)
            runLen = force

        stack.append((lo, runLen))

        lo += runLen
        nRemaining -= runLen

    while len(stack) > 1:
        base2, len2 = stack.pop()
        base1, len1 = stack.pop()
        merge(arr, base1, base2, base2 + len2)
        stack.append((base1, len1 + len2))

Balanced Merges

Another key optimization of Timsort is trying as much as possible to merge runs of balanced sizes. The closer the size, the better average performance as a combination of additional space required and number of operations.

So far we just pushed everything onto a stack, then merged the top 2 elements of the stack until we ended up with a single run. We actually want to do something a bit different: we want our stack to maintain a couple of invariants:

stack[i - 1][1] > stack[i][1] + stack[i + 1][1] - the length of a run needs to be larger than the sum of the lengths of the following runs.
stack[i][1] > stack[i + 1][1] - the length of a run needs to be larger than the following run.

When pushing a new index and run length tuple onto the stack, we check if the invariant still holds. If it doesn't, we merge stack[i] with the smallest of stack[i - 1], stack[i + 1] and recheck. We continue merging until the invariants are re-established. Let's call this function mergeCollapse:

def mergeCollapse(arr, stack):
    while len(stack) > 1:
        n = len(stack) - 2
        if (n > 0 and stack[n - 1][1] <= stack[n][1] + stack[n + 1][1]) or \
        (n > 1 and stack[n - 2][1] <= stack[n][1] + stack[n - 1][1]):
        if stack[n - 1][1] < stack[n + 1][1]:
            n -= 1
        elif n < 0 or stack[n][1] > stack[n + 1][1]:
            break

        mergeAt(arr, stack, n)

We start from the top of the stack - 2. If n > 0 and the invariant doesn't hold for stack[n - 1], stack[n], and stack[n + 1] or if n > 1 and the invariant doesn't hold for stack[n - 2], stack[n - 1] and stack[n], we need to merge. We decide whether we want to merge stack[n] with stack[n + 1] or stack[n - 1] with stack[n] depending on which one is smallest (if stack[n - 1] is smaller, then we decrement n to trigger the merge at n - 1.

If the invariant holds, we check for the other invariant: stack[n][1] > stack[n + 1][1]. If this second invariant holds, we're done and we can break out of the loop (we do the same if we ran out of elements). If not, we trigger a merge by calling mergeAt and repeat until we either merge everything or the invariant is reestablished.

We start by checking only the top few elements of the stack, since we expect the rest of the stack to hold the invariants. We only call this function when we push a new run on the stack, in which case we need to ensure we merge as needed.

Let's take a look at mergeAt. This function simply merges the runs at positions n and n + 1 on the stack:

def mergeAt(arr, stack, i):
    assert i == len(stack) - 2 or i == len(stack) - 3

    base1, len1 = stack[i]
    base2, len2 = stack[i + 1]

    stack[i] = (base1, len1 + len2)

    if i == len(stack) - 3:
        stack[i + 1] = stack[i + 2]
    stack.pop()

    merge(arr, base1, base2, base2 + len2)

Remember we only ever merge either the second from top and top runs or the third from top and second from top runs. So i should be either len(stack) - 2 or len(stack) - 3. We get the first element and run length for the two runs and update the stack: stack[i] starts at the same position but will now have the length of both unmerged runs. If we are merging stack[-3] with stack[-2], we need to copy stack[-1] (top of the stack) to stack[-2] (second to top). Finally, we pop the top of the stack. At this point, the stack is updated. We call merge on the two runs to update arr too.

We can now maintain a healthy balance for merges. Remember, the whole reason for this is to aim to always merge runs similar in size.

Of course, once we are done pushing everything on the stack, we still need to force merging to finish our sort. We'll do this with mergeForceCollapse:

def mergeForceCollapse(arr, stack):
    while len(stack) > 1:
        n = len(stack) - 2
        if n > 0 and stack[n - 1][1] < stack[n + 1][1]:
            n -= 1

        mergeAt(arr, stack, n)

This function again merges the second from the top run with the smallest of third from the top or top. It continues until all runs are merged into one. Our updates sort looks like this:

def sort(arr): 
    lo, hi = 0, len(arr) 
    stack = []
    nRemaining = hi
    minRun = MIN_MERGE

    while nRemaining > 0:
        runLen = countRunAndMakeAscending(arr, lo, hi)

        if runLen < minRun:
            force = min(nRemaining, minRun)
            binarySort(arr, lo, lo + force, lo + runLen)
            runLen = force

        stack.append((lo, runLen))
        mergeCollapse(arr, stack)

        lo += runLen
        nRemaining -= runLen

    mergeForceCollapse(arr, stack)

Instead of pushing everything onto the stack and merging everything at the end, we now call mergeCollapse after each push to keep the runs balanced. At the end, we call mergeForceCollapse to force-merge the stack.

Run Lengths

We used a constant minimum run length so far, but mentioned earlier that it is in fact determined as a function of the size of the input. We will determine this with minRunLength:

def minRunLength(n): 
    r = 0
    while n >= MIN_MERGE: 
        r |= n & 1
        n >>= 1
    return n + r

This function takes the length of the input and does the following:

If n is smaller than MIN_MERGE, returns n - the input size is too small to use complicated optimizations on.
If n is a power of 2, the algorithm will return MIN_MERGE / 2. Note: MIN_MERGE is also a power of 2. In our initial sketch we set it to 4, but in practice this is usually 32 or 64.
Otherwise return a number k between MIN_MERGE / 2 and MIN_MERGE so that n / k is close to but strictly less than a power of 2.

It does this by shifting n one bit to the right until it is less than MIN_MERGE. In case any shifted bit is 1, it means n is not a power of 2. In that case, we set r to 1 and return n + 1.

The reason we do all of this work is to again strive to keep merges balanced. If we get an input like 2048 and our MIN_MERGE is 64, we get back 32. That means that, if we don't have any great runs in our input, we end up with 64 runs, each of length 32. We saw in the previous section how we balance the stack. Consider we're pushing these runs onto the stack:

We push the run (0, 32) on the stack (first 32 elements).
We push the run (32, 32) on the stack (next 32 elements).
This triggers a merge since the run (0, 32) is not greater than the run (32, 32). The stack becomes (0, 64).
We push the run (64, 32) on the stack (next 32 elements).
We push the run (96, 32) on the stack (next 32 elements).
This again triggers a merge, since the length of the run (0, 64) (64) is not greater than the length of the next two runs, both of which are 32. The run (64, 32) gets merged with the smaller run, (96, 32). The stack becomes [(0, 64), (64, 64)].
The second invariant no longer holds: the first run is not longer than then next one. Another merged is triggered and the stack becomes [(0, 128)].

This goes on in the same fashion, and all merges end up being perfectly balanced. This works great for powers of 2.

Now let's consider another case: what if the input is 2112? If we would still use 32 as our minimum run length, we would get 66 runs of length 32. The first 64 will trigger perfectly balanced merges as before, but then we end up with the stack [(0, 2048), (2048, 32), (2080, 32)]. This collapses to [(0, 2048), (2048, 64)], triggering a completely unbalanced merge (2048 on one side and 64 on the other).

To keep things balanced, if our input is not a power of 2, we pick a minimum run length that is close to but strictly less than a power of 2. Let's update our MIN_MERGE to be 32, and update our sort to call minRunLength instead of automatically setting it to MIN_MERGE. We'll throw in another quick optimization: if the whole input is smaller than MIN_MERGE, don't even bother with the whole thing: find a starting run then binary sort the rest, without any merging.

MIN_MERGE = 32

def sort(arr):
    lo, hi = 0, len(arr)
    stack = []
    nRemaining = hi
    if nRemaining < MIN_MERGE:
        initRunLen = countRunAndMakeAscending(arr, lo, hi)
        binarySort(arr, lo, hi, lo + initRunLen)
        return
    minRun = minRunLength(len(arr))
    while nRemaining > 0:
        runLen = countRunAndMakeAscending(arr, lo, hi)
        if runLen < minRun:
            force = min(nRemaining, minRun)
            binarySort(arr, lo, lo + force, lo + runLen)
            runLen = force
        stack.append((lo, runLen))
        mergeCollapse(arr, stack)
        lo += runLen
        nRemaining -= runLen
    mergeForceCollapse(arr, stack)

Optimized Merging

We can optimize merging further. Our initial implementation of merge simply copied the first run into a buffer, then performed the merge. We can do better than that.

What if the second run is smaller? Maybe we'd prefer always merging the smaller run into the larger one. Let's look at an optimized version of merge. First, we'll replace merge with two functions, mergeLo and mergeHi. mergeLo will copy elements from the first run into the temporary buffer, while mergeHi will copy elements from the second run. Our original merge becomes mergeLo, and we can add a mergeHi:

def mergeHi(arr, lo, mid, hi):
    t = arr[mid:hi]
    i, j, k = hi - 1, mid - 1, hi - mid - 1
    while k >= 0 and j >= lo:
        if t[k] > arr[j]:
            arr[i] = t[k]
            k -= 1
        else:
            arr[i] = arr[j]
            j -= 1
        i -= 1

    if k >= 0:
        arr[lo:i + 1] = t[0:k + 1]

This is very similar with merge, except it copies the second (mid to hi) run into a temporary buffer and traverses the runs and the buffer from end to start.

When we trigger the merge, another optimization we can do is check elements from the first run and see if they are smaller than the first element in the second run. While they are smaller, we can simply ignore them when merging - they are already in position. We do this by taking the first element of the second run and seeing where it would fit in the first run.

Similarly, elements from the end of the second run which are greater than the last element in the first run are already in place. We don't need to touch them. We take the last element of the first run and check where it would fit in the first run.

We can use binary search for this. Note that we need two version in order to maintain the stable property of the sort: a searchLeft, which returns the first index where a new element should be inserted, and a searchRight, which returns the last index. For example, if we have a run like [1, 2, 5, 5, 5, 5, 7, 8] and we are looking for where to insert another 5, it really depends where it comes from. If it comes from the run before this one, we need the left-most spot (before the first 5 in the run). On the other hand, if it comes from the run after this one, we need to place it after the last 5. That ensures that the relative order of elements is preserved. Here is an implementation for searchLeft and searchRight:

def searchLeft(key, arr, base, len):
    left, right = base, base + len
    while left < right:
        mid = left + (right - left) // 2
        if key > arr[mid]:
            left = mid + 1
        else:
            right = mid

    return left - base

def searchRight(key, arr, base, len):
    left, right = base, len

    while left < right:
        mid = left + (right - left) // 2
        if key < arr[mid]:
            right = mid
        else:
            left = mid + 1

    return left - base

Both functions return the offset from base where key should be inserted.

We can now update our mergeAt function with the new capabilities:

def mergeAt(arr, stack, i):
    base1, len1 = stack[i]
    base2, len2 = stack[i + 1]

    stack[i] = (base1, len1 + len2)
    if i == len(stack) - 3:
        stack[i + 1] = stack[i + 2]
    stack.pop()

    k = searchRight(arr[base2], arr, base1, len1)
    base1 += k
    len1 -= k
    if len1 == 0:
        return

    len2 = searchLeft(arr[base1 + len1 - 1], arr, base2, len2)
    if len2 == 0:
        return

    if len1 > len2:
        mergeLo(arr, base1, base2, base2 + len2)
    else:
        mergeHi(arr, base1, base2, base2 + len2)

The first part stays the same: we get base1, len1, base2, and len2 and update the stack. Next, instead of merging right away, we first search for where the first element of the second run would go into the first run. We know the elements in [base1, k) won't move, so we can remove them from the merge by moving base1 to the right k elements (we also need to update len1). Similarly, we search for where the last element of the first run (arr[base1 + len1 - 1]) would fit into the second run. We know all elements beyond that are already in place, so we update len2 to be this offset.

In case either of the searches exhausts a run, we simply return. Otherwise, depending on which run is longer, we call mergeLo or mergeHi.

Galloping

But wait, there's more! Binary search always performs log(len + 1) comparisons where len is the length of the array we are searching for regardless of where our element belongs. Galloping attempts to find the spot faster.

Galloping starts by comparing the element we are searching for in array A with A[0], A[1], A[3], ... A[i^2 - 1]. With these comparisons, we will end up finding a range between some A[(k - 1)^2 - 1] and A[k^2 - 1] that would contain the element we are searching for. We then run a binary search only within that interval.

There are some tradeoffs here: on large datasets or purely random data, binary search performs better. But on inputs which contain natural runs, galloping tends to find things faster. Galloping also performs better when we expect to find the interval early on. Let's look at an implementation of gallopLeft as an alternative to searchLeft:

def gallopLeft(key, arr, base, len, hint):
    lastOfs, ofs = 0, 1

    if key > arr[base + hint]:
        maxOfs = len - hint
        while ofs < maxOfs and key > arr[base + hint + ofs]:
            lastOfs = ofs
            ofs = (ofs << 1) + 1

        if ofs > maxOfs:
            ofs = maxOfs

        lastOfs += hint
        ofs += hint
    else: # key <= arr[base + hint]
        maxOfs = hint + 1
        while ofs < maxOfs and key <= arr[base + hint - ofs]:
            lastOfs = ofs
            ofs = (ofs << 1) + 1

        if ofs > maxOfs:
            ofs = maxOfs

        lastOfs, ofs = hint - ofs, hint - lastOfs

    # arr[base + lastOfs] < key <= arr[base + ofs]
    lastOfs += 1
    while lastOfs < ofs:
        mid = lastOfs + (ofs - lastOfs) // 2
        if key > arr[base + mid]:
            lastOfs = mid + 1
        else:
            ofs = mid
    return ofs

We start by initializing 2 offsets: lastOfs and ofs to represent the offsets between which we expect to find our key. Note the function also takes a hint, so callers can provide a tentative starting place.

Let's go over the parts of this function:

if key > arr[base + hint]:
    maxOfs = len - hint
    while ofs < maxOfs and key > arr[base + hint + ofs]:
        lastOfs = ofs
        ofs = (ofs << 1) + 1

    if ofs > maxOfs:
        ofs = maxOfs

    lastOfs += hint
    ofs += hint

We first find the two offsets. If the key we are searching for is greater than (right of) our starting element (arr[base + hint]), then our maximum possible offset is len - hint. While ofs is hasn't overflowed and the key is still larger than arr[base + hint + ofs], we keep updating ofs to be the next power of 2 minus 1. We keep track of the previous offset in lastOfs. Once we're done, we add hint to both offsets (we do that because we add hint to all indices in our loop, but not to ofs since we keep it a power of 2 minus 1). If key > arr[base + hint] is not true, in other words, our key is left of our starting element:

else: # key <= arr[base + hint]
    maxOfs = hint + 1
    while ofs < maxOfs and key <= arr[base + hint - ofs]:
        lastOfs = ofs
        ofs = (ofs << 1) + 1

    if ofs > maxOfs:
        ofs = maxOfs

    lastOfs, ofs = hint - ofs, hint - lastOfs

In this case, our maximum possible offset is hint + 1. We gallop again, but now we are looking at elements left of our starting point, arr[base + hint - ofs] where ofs keeps increasing. Once we find the range, we update our offsets: lastOfs becomes hint - ofs and ofs becomes hint - lastOfs. The hint - part is again because that is what we actually used as indices. The swap is because we were moving left, and we need lastOfs to be the one on the left, ofs the one on the right.

We now identified the range within which we'll find our key, between arr[base + lastOfs] and arr[base + ofs]. The last part of the function is just a binary search within this interval.

The gallopRight function is very similar to gallopLeft:

def gallopRight(key, arr, base, len, hint):
    ofs, lastOfs = 1, 0

    if key < arr[base + hint]:
        maxOfs = hint + 1
        while ofs < maxOfs and key < arr[base + hint - ofs]:
            lastOfs = ofs
            ofs = (ofs << 1) + 1

        if ofs > maxOfs:
            ofs = maxOfs
        lastOfs, ofs = hint - ofs, hint - lastOfs
    else:
        maxOfs = len - hint
        while ofs < maxOfs and key >= arr[base + hint + ofs]:
            lastOfs = ofs
            ofs = (ofs << 1) + 1

        if ofs > maxOfs:
            ofs = maxOfs

        lastOfs += hint;
        ofs += hint;

    lastOfs += 1
    while lastOfs < ofs:
        mid = lastOfs + ((ofs - lastOfs) // 2)
        if key < arr[base + mid]:
            ofs = mid
        else:
            lastOfs = mid + 1
    return ofs

We won't cover this in details: the difference is here, like with searchRight, we want to find the rightmost index where key belongs instead of the leftmost one, so the algorithm changes accordingly.

The very neat thing about galloping is that its use isn't limited to only when we set up the merge. We can also gallop while merging. Let's go over mergeLo example, since mergeHi is a mirror of this.

In mergeLo, we first copy all elements from the first run to a buffer, then we iterate over the array and at each position we copy either an element from the buffer or one from the second run, depending on which one is smaller. While we do this, we can keep track of how many times the buffer or the second run won. If one of these wins consistently, we can assume it will keep winning for a while longer.

For example, if we merge [5, 6, 7, 8, 9] with [0, 1, 2, 3, 4], we initialize the buffer with [5, 6, 7, 8, 9], but for the next 5 comparisons, the second run wins (0 < 5, 1 < 5 ...). Now imagine much longer runs. Instead of comparing all elements one by one, we switch to a galloping mode:

We find the last spot where the next element of the second run would fit into the buffer, and immediately copy the preceding elements of the buffer into the array. For example, if our buffer is [12, 13, 14, 15, 17] and the element we are considering from the second run is [16], we know we can copy [12, 13, 14, 15] into the array. Similarly, we find the first spot the next element in the buffer would fit into the remaining second run, and copy elements before that from the second run to their position. The galloping mode aims to reduce the number of comparisons and bulk copy data when possible (using a memcpy equivalent where available). While galloping, we still keep track of how many elements we were able to skip comparing individually. If this falls below the galloping threshold, we switch back to regular mode. Here is an updated mergeLo implementation:

MIN_GALLOP = 7
minGallop = MIN_GALLOP

def mergeLo(arr, lo, mid, hi):
    t = arr[lo:mid]
    i, j, k = lo, mid, 0
    global minGallop
    done = False

    while not done:
        count1, count2 = 0, 0
        while (count1 | count2) < minGallop:
            if t[k] < arr[j]:
                arr[i] = t[k]
                count1 += 1
                count2 = 0
                k += 1
            else:
                arr[i] = arr[j]
                count1 = 0
                count2 += 1
                j += 1
            i += 1

            if k == mid - lo or j == hi:
                done = True
                break

        if done:
            break

        while count1 >= MIN_GALLOP or count2 >= MIN_GALLOP:
            count1 = gallopRight(arr[j], t, k, mid - lo - k, 0)
            if count1 != 0:
                arr[i:i + count1] = t[k:k + count1]
                i += count1
                k += count1
                if k == mid - lo:
                    done = True
                    break

            arr[i] = arr[j]
            i += 1
            j += 1
            if j == hi:
                done = True
                break

            count2 = gallopLeft(t[k], arr, j, hi - j, 0)
            if count2 != 0:
                arr[i:i + count2] = arr[j:j + count2]
                i += count2
                j += count2
                if j == hi:
                    done = True
                    break

            arr[i] = t[k]
            i += 1
            k += 1
            if k == mid - lo:
                done = True
                break

            minGallop -= 1

        if minGallop < 0:
            minGallop = 0
        minGallop += 2

    if k < mid - lo:
        arr[i:hi] = t[k:mid - lo]

We introduced a new MIN_GALLOP constant which is the threshold after we want to start galloping. We also maintain a minGallop variable across merges.

We have a couple of nested while loops, but the idea is pretty straightforward. The first nested while does the normal merge but now keeps track of how many times in the row did we end up picking an element from the buffer:

count1, count2 = 0, 0
while (count1 | count2) < minGallop:
    if t[k] < arr[j]:
        arr[i] = t[k]
        count1 += 1
        count2 = 0
        k += 1
    else:
        arr[i] = arr[j]
        count1 = 0
        count2 += 1
        j += 1
    i += 1

    if k == mid - lo or j == hi:
        done = True
        break

if done:
    break

Whenever we increment one counter, we set the other to 0, so at any point, at most one of them is different than 0. We can exit the while loop in two ways: either one of the counters reaches the gallop threshold, or we run out of elements in one of the arrays.

If we ran out of elements we are done, so we break out of the outer loop. Otherwise we are in gallop mode:

while count1 >= MIN_GALLOP or count2 >= MIN_GALLOP:
    count1 = gallopRight(arr[j], t, k, mid - lo - k, 0)
    if count1 != 0:
        arr[i:i + count1] = t[k:k + count1]
        i += count1
        k += count1
        if k == mid - lo:
            done = True
            break

    arr[i] = arr[j]
    i += 1
    j += 1
    if j == hi:
        done = True
        break

    count2 = gallopLeft(t[k], arr, j, hi - j, 0)
    if count2 != 0:
        arr[i:i + count2] = arr[j:j + count2]
        i += count2
        j += count2
        if j == hi:
            done = True
            break

    arr[i] = t[k]
    i += 1
    k += 1
    if k == mid - lo:
        done = True
        break

    minGallop -= 1

if minGallop < 0:
    minGallop = 0
minGallop += 2

We first try to find where the next element in the second run would fit into the buffer. That becomes our count1. If we get an offset greater than 0, we can bulk copy the previous elements from the buffer ([k, k + count1)) to the range [i, i + count1) and increment both k and i by count1. Once we're done, we know for sure we need to copy the next element from the second run (a[j]), so we do that.

We then do the opposite: gallop left to find where the next element from the buffer would fit into the second run. That becomes our count2 and if it is greater than 0, we bulk copy elements from the second run. Once we're done, we again now that the next element to copy is at t[k], so we do that.

This loop repeats while either count1 or count2 is greater than MIN_GALLOP. If galloping works, we also update minGallop to favor future galloping. Each time we iterate, we decrement minGallop. Once we're out of the loop, if it is due to both count1 and count2 being smaller than MIN_GALLOP, we again adjust minGallop - first, if it became negative, we make it 0. We then add 2 to penalize galloping because our last iteration didn't meet MIN_GALLOP. As a reminder, minGallop is used as the threshold in the first loop. These tweaks to minGallop aim to optimize, depending on the data, when to enter gallop mode and when to keep merging in normal mode.

minGallop state should be maintained across multiple merges, and only reset when we start a new sort - so we would make minGallop = MIN_GALLOP in our main sort function, but otherwise rely on the same value we are updating in minGallop for subsequent calls of mergeLo and mergeHi. We made minGallop a global to keep the code (relatively) simpler. To avoid globals, we should either put all functions in a class and have minGallop be a member, or pass it through as an argument through all functions that need it.

Finally, we copy the remaining elements in the buffer, if any:

if k < mid - lo:
    arr[i:hi] = t[k:mid - lo]

We also have the mirrored mergeHi version:

def mergeHi(arr, lo, mid, hi):
    t = arr[mid:hi]
    i, j, k = hi - 1, mid - 1, hi - mid - 1
    global minGallop
    done = False

    while not done:
        count1, count2 = 0, 0
        while (count1 | count2) < minGallop:
            if t[k] > arr[j]:
                arr[i] = t[k]
                count1 += 1
                count2 = 0
                k -= 1
            else:
                arr[i] = arr[j]
                count1 = 0
                count2 += 1
                j -= 1
            i -= 1

            if k == -1 or j == lo - 1:
                done = True
                break

        if done:
            break

        while count1 >= MIN_GALLOP or count2 >= MIN_GALLOP:
            count1 = j - lo + 1 - gallopRight(t[k], arr, lo, j - lo + 1, j - lo)
            if count1 != 0:
                arr[i - count1 + 1:i + 1] = arr[j - count1 + 1:j + 1]
                i -= count1
                j -= count1
                if j == lo - 1:
                    done = True
                    break

            arr[i] = t[k]
            i -= 1
            k -= 1

            if k == -1:
                done = True
                break

            count2 = k + 1 - gallopLeft(arr[j], t, 0, k + 1, k)
            if count2 != 0:
                arr[i - count2 + 1:i + 1] = t[k - count2 + 1:k + 1]
                i -= count2
                k -= count2
                if k == -1:
                    done = True
                    break

            arr[i] = arr[j]
            i -= 1
            j -= 1
            if j == lo - 1:
                done = True
                break

            minGallop -= 1

        if minGallop < 0:
            minGallop = 0
        minGallop += 2

    if k >= 0:
        arr[lo:i + 1] = t[0:k + 1]

This is very similar to the previous one, so I won't break it into pieces and explain, just note that since we are starting from the end of the range and we go backwards, we use closed ranges: i, j, and k always point to the last element of the range, not the one past the last.

Summary

This is a very efficient sorting algorithm which relies on observed properties of datasets in the real world. Quick recap:

Depending on the size of the input, we determine a good size for runs, so we can get balanced merges.
We traverse the array and identify runs. If the run is descending, we reverse it. If we don't get enough elements in a run to hopefully get balanced merges, we extend the run by adding more elements and sorting them using binary sort.
We push runs on a stack which maintains a couple of invariants to, again, keep merges balanced: the second to top run of the stack must be longer than the top run and the third to top run must be longer than the sum of the second and top runs.
If an invariant is violated, we start merging until we reestablish it. We merge the second from the top run with the shortest of third from top or top (again aiming for balanced overall merging). Merges always merge consecutive runs.
Merge is optimized such that we first identify elements at the beginning of the first run and the end of the second run which are already in place, and we skip them.
Next, depending on which of the runs is larger, we merge either from left or from right.
Merge happens in two modes: we compare and merge normally, until we see one of the two runs we're merging consistently gets picked. Once we pass a certain threshold, we switch to galloping mode.
Galloping aims to provide better performance than binary sort on smaller datasets, where we expect to find the position we're searching for earlier rather than later in the search. Galloping tries to find a k such that the position we looking for is within A[(k - 1)^2] and A[k^2], then performs a binary search in the interval.
Merging in galloping mode tries to find a range of elements in the run that tends to win. This range can be bulk-copied in the merge portion of the array more efficiently and skipping extra comparisons.
If galloping becomes less effective, merge switches back to normal mode.
Another heuristic keeps track of how well galloping mode performs and either encourages or discourages entering galloping mode again. This is persisted across multiple merges in a single sort.

Thoughts

Is this sorting algorithm beautiful? Maybe not from a purely syntactical/readability perspective. Compare it with the recursive quicksort implementation in Haskell:

quicksort :: (Ord a) => [a] -> [a]  
quicksort [] = []  
quicksort (x:xs) =   
    let smallerSorted = quicksort [a | a <- xs, a <= x]  
        biggerSorted = quicksort [a | a <- xs, a > x]  
    in  smallerSorted ++ [x] ++ biggerSorted

Timsort is not a succinct algorithm. There are special cases, optimizations for left to right and right to left cases, galloping, which tries to beat binary search in some situations, multi-mode merges and so on.

That said, everything in it has one purpose: sort real world data efficiently. I find it beautiful for the amount of research that went into it, the major insight that real world data is usually partially sorted, and for how it adapts to various patterns in the data to improve efficiency.

Most real world software looks more like Timsort than the Haskell quicksort above. And while there is, unfortunately, way too much accidental complexity in the world of software, there is a limit to how much we can simplify before we can no longer model reality, or operate efficiently. And, ultimately, that is what matters.

References

The final version of the code in this blog post is in this GitHub gist (be advised: implementation might be buggy).

Tim Peters has a very detailed explanation of the algorithm and all optimizations in the Python codebase as listsort.txt. I do recommend reading this as it talks about all the research and benchmarks that went into developing Timsort.

The C implementation of Timsort in the Python codebase is listobject.c.

The Python implementation relies on a lot of Python runtime constructs, so it might be harder to read. My implementation is derived from the OpenJDK implementation which I found very readable. That one is here on GitHub.

Mental Poker

Sat, 11 Dec 2021 00:00:00 -0800

Mental Poker

For the past year or so, I've been on the Fluid Framework team. I won't go deeply into the details of the framework, rather I'll quote a few paragraphs from the Overview page:

What is Fluid Framework?

Fluid Framework is a collection of client libraries for distributing and synchronizing shared state. These libraries allow multiple clients to simultaneously create and operate on shared data structures using coding patterns similar to those used to work with local data.

Why Fluid?

Because building low-latency, collaborative experiences is hard!

Fluid Framework offers:

Client-centric application model with data persistence requiring no custom server code.

Distributed data structures with familiar programming patterns.

Very low latency.

Applications built with Fluid Framework require zero custom code on the server to enable sophisticated data sync scenarios such as real-time typing across text editors. Client developers can focus on customer experiences while letting Fluid do the work of keeping data in sync.

How Fluid works

Fluid was designed to deliver collaborative experiences with blazing performance. To achieve this goal, the team kept the server logic as simple and lightweight as possible. This approach helped ensure virtually instant syncing across clients with very low server costs.

To keep the server simple, each Fluid client is responsible for its own state. While previous systems keep a source of truth on the server, the Fluid service is responsible for taking in data operations, sequencing the operations, and returning the sequenced operations to the clients. Each client is able to use that sequence to independently and accurately produce the current state regardless of the order it receives operations.

The following is a typical flow.

Client code changes data locally.

Fluid runtime sends that change to the Fluid service.

Fluid service sequences that operation and broadcasts it to all clients.

Fluid runtime incorporates that operation into local data and raises a valueChanged event.

Client code handles that event (updates view, runs business logic).

When using Fluid Framework, you model your data using a set of distributed data structures which can internally merge changes from multiple clients.

During various hackathons, the team built various applications using this data model. Of course, one of the first applications of any new technology is games. This got me thinking about how we could model a game on top of the framework.

There are some interesting constraints: games like chess or go don't have any hidden information, but most games do require some hidden information. Card games are especially interesting: each player holds some cards that only themselves can see, some cards are face up on the table (everyone can see them), while the rest of the deck is face down on the table (nobody sees what order the cards are in).

With Fluid Framework, data is replicated across all clients. Assuming we're playing a game of high stakes poker, we can't trust any other client not to cheat. So a naÃ¯ve solution of sending the whole game state (cards each player holds in their hand) to all clients and trust clients not to peek won't work. We should assume that even if the game code only shows a client their own cards, the client can cheat and use a debugger to see what other players are holding in their hands.

We can trust the server, but there is very little the server can do for us -while it can tell us which client changed state (distributed data structure changes sequenced by the server include client ID), the server itself cannot maintain private state. So, for example, we can't tell the server to shuffle a deck of card without telling us what order the cards end up in - all shared state is replicated across all clients.

In this zero-trust environment, where we assume other clients can cheat and all shared state can be accessed by all clients, can we model a card game? Surprisingly, the answer is yes.

Mental Poker

Turns out this exact problem has been studied for quite some time, starting with the original 1981 paper by Ron Rivest, Adi Shamir, and Leonard Adleman (inventors of the RSA algorithm among other things).

Once there were two mental chess experts who had become tired of their pastime. Let's play mental poker, for variety suggested one. Sure said the other. Just let me deal!

Mental poker requires a commutative encryption function. If we encrypt $A$ using $Key_1$ then encrypting the result using $Key_2$, we should be able to decrypt the result back to $A$ regardless of the order of decryption (first with $Key_1$ and then with $Key_2$, or vice-versa).

Here is how Alice and Bob play a game of mental poker:

Alice takes a deck of cards (an array), shuffles the deck, generates a secret key $K_A$, and encrypts each card with $K_A$.
Alice hands the shuffled and encrypted deck to Bob. At this point, Bob doesn't know what order the cards are in (since Alice encrypted the cards in the shuffled deck).
Bob takes the deck, shuffles it, generates a secret key $K_B$, and encrypts each card with $K_B$.
Bob hands the deck to Alice. At this point, neither Alice nor Bob know what order the cards are in. Alice got the deck back reshuffled and re-encrypted by Bob, so she no longer knows where each card ended up. Bob reshuffled an encrypted deck, so he also doesn't know where each card is.

At this point the cards are shuffled. In order to play, Alice and Bob also need the capability to look at individual cards. In order to enable this, the following steps must happen:

Alice decrypts the shuffled deck with her secret key $K_A$. At this point she still doesn't know where each card is, as cards are still encrypted with $K_B$.
Alice generates a new set of secret keys, one for each card in the deck. Assuming a 52-card deck, she generates $K_{A_1} ... K_{A_{52}}$ and encrypts each card in the deck with one of the keys.
Alice hands the deck of cards to Bob. At this point, each card is encrypted by Bob's key, $B_K$, and one of Alice's keys, $K_{A_i}$.
Bob decrypts the cards using his key $K_B$. He still doesn't know where each card is, as now the cards are encrypted with Alice's keys.
Bob generates another set of secret keys, $K_{B_1} ... K_{B_{52}}$, and encrypts each card in the deck.
Now each card in the deck is encrypted with a unique key that only Alice knows and a unique key only Bob knows.

If Alice wants to look at a card, she asks Bob for his key for that card. For example, if Alice draws the first card, encrypted with $K_{A_1}$ and $K_{B_1}$, she asks Bob for $K_{B_1}$. If Bob sends her $K_{B_1}$, she now has both keys to decrypt the card and look at it. Bob still can't decrypt it because he doesn't have $K_{A_1}$.

This way, as long as both Alice and Bob agree that one of them is supposed to see a card, they exchange keys as needed to enable this.

At the end of the game, players reveal all keys to validate that no cheating happened.

This approach can be extended to any number of players, each player maintaining their own set of secret keys.

Modeling a Game

We can model a game using two data structures: one to keep track of the cards, one to keep track of the moves in the game.

We can model a deck of cards using a distributed data structure that holds the set of cards. Each client generates secret keys and initially keeps them private (not part of the shared state). The deck of cards can be shuffled and encrypted as described above, with each client updating the shared set of cards.

We can model the gameplay using an append-only list of moves. For example, if Alice draws the first card, the move can be modeled as DRAW 1. If Bob agrees Alice should see the card, Bob can publish his secret key $K_{B_1}$ as PUBLISH . Alice can now use her $K_{A_1}$ and the published $K_{B_1}$ to decrypt the first card of the deck (stored in the other data structure). DRAW, PUBLISH, and other actions are part of the game semantics, which can be implemented and interpreted by clients.

Note the deck of cards stays in place during the game. Drawing a card means simply that all clients agree Alice should get the keys to the card at index 1 and that the next card to be drawn is at index 2. Discarding a card simply means Bob said he discards the card at index 5. Depending on whether discarding is face up or face down, Bob can publish $K_{B_5}$ or keep it private until the end of the game. All these actions are part of the game move list, and clients can construct the game state based on these, without having to mutate the deck itself.

In terms of trust, we can say that, at any point, if a client can prove the game is invalid (another client misbehaved), the game is cancelled. If a player acts out of turn, or performs an action that they shouldn't, the game is invalid. At the end of the game, the append-only list should contain the full record of moves. With all keys available, clients can replay and validate no cheating happened (for example Bob claiming a card decrypted to an Ace, when in fact the card was a 2). Clients can keep a local copy of the list of moves, and confirm no other client rewrote history by tweaking the content of the list.

Establishing turn order can also be modeled through the append-only action list: each player can start by adding a SIT AT TABLE action. The framework will sequence these action in some order, which will become the turn order. For example, if both Alice and Bob concurrently SIT AT TABLE, the action list will contain both actions in some order. Alice and Bob will take turns in that order.

Game semantics can be implemented as actions clients interpret. This is outside the scope of this article.

Resources

As I mentioned, this problem has been studied for many decades. A Toolbox for Mental Card Games by Christian Schindelhauer describes many other techniques for playing cards in a zero-trust environment.

There is also an open-source C++ library implementing the toolbox: LibTMCG.

The https://secret.cards website seems to implement a card game using mental poker techniques.

Wikipedia also has a good page on mental poker.

Notes on Software Lifecycle

Sat, 27 Nov 2021 00:00:00 -0800

Notes on Software Lifecycle

I spent a lot of time lately looking at how our team can improve our product's reliability and capability to respond to production incidents. This got me thinking about the lifecycle of a contemporary software project. The model I've been using for this looks like the following:

At a high level, the cycle starts with engineers writing code. The code gets merged and at some cadence, a new build is prepared for release. This usually includes looking at a combination of testing and telemetry signals for engineers to signoff on deploying the build. In case tests fail or telemetry shows some anomalies, the deployment is abandoned. If all looks good, the build gets deployed. Once the build is exposed to a larger audience, more telemetry signals come in.

Most software these days uses some form of controlled exposure. For example, services might be deployed first in a dev environment, then in a pre-production environment, then to production in one region, then to all regions. Client software is similarly deployed to different rings, for example new Office builds get deployed first to the Office organization, then to all of Microsoft, then to customers who opted into the Insider program, then to the whole world (I'm very much oversimplifying things here, as release management for Office is way more complex, but you get the idea). Telemetry signals from a ring feed back into the build promotion process to give confidence that a build can be exposed to a larger audience in the next ring.

Of course, sometimes things go wrong. We identify issues in the product, either from telemetry signals or, worse, from user reports. These become live site incidents. On-call engineers react to these and try to mitigate as fast as possible. After the fire is put out, a good practice is to run a postmortem to understand how the issue happened and see how it can be prevented in the future. The learnings usually translate into repair items, which get added to the engineering backlog.

We can split this lifecycle into two parts: a proactive part and a reactive part, which roughly map to the top and bottom halves of the diagram.

Proactive

The proactive part deals with what we can do to prevent issues from making it to production.

There are several things that could allow issues to slip through the cracks.

Code

On the coding part, a feature might be missing tests to uncover regressions, it might not be instrumented well enough to get good signals, or it might not be put under a feature gate. Feature gates are service-controlled flags that can be turned on/off to disable a feature. These are extremely valuable for quickly mitigating production issues.

All of the above are addressed through education and engineering culture: more junior engineers on the team might not even be aware of all the requirements a feature should satisfy before it is ready (see my Shipping a Feature post).

A good practice is to have a feature checklist, a list of things engineers need to consider before submitting a pull request. This includes things like test coverage, telemetry, feature gates, performance, accessibility (for UI) etc.

Everyone writing code should know where this checklist is, and code reviewers should keep it in mind while evaluating changes.

Signoff for Build Promotion

Two main issues would allow a regression to get passed the build validation process: either there is a gap in validation, or missed signals. This, of course, assumes that the code has tests and is properly instrumented in the coding stage. Here, the person or persons validating a build, either miss running some validation (automatic or manual tests) or miss looking at a telemetry signal that would tell them something is wrong.

Both of these issues can be addressed with automation.

Have a go/no-go dashboard that aggregates all relevant signals (like test run results, telemetry metrics).

Of course, putting together such a dashboard and ensuring all code has the right test automation and instrumentation is not easy.

Telemetry Signals

Telemetry could have gaps: issues could manifest themselves without us receiving a signal. If this happens, we need to learn from these incidents, understand where the gaps are, and eliminate them. More about this on the reactive part.

Reactive

The reactive part deals with how we can mitigate issues as quickly as possible if they make it to production.

Incidents

The entry point into the reactive cycle is an incident. An incident alerts the on-call engineer and starts the mitigation process. The sooner an incident is created, the sooner it can be addressed.

Issues here come from alerting. An alerting system runs automated queries over incoming telemetry signals and looks for some anomalies or thresholds. Things can go wrong in multiple ways:

We can collect a lot of telemetry but not have the right queries to notice sudden spikes, or drops, or other anomalies in the telemetry stream.
We could be overly cautious and generate too many alerts, most of them false positives, which makes it hard for on-call to figure out when an alert is real.
Alerts might be very generic and not contain enough information for on-call to easily mitigate.

Alerts should be continuously finetuned to be accurate and actionable, with as few false positives as possible.

Telemetry signals, even if correct, can be impacted by multiple things outside of our control. For example, usage might raise or drop sharply during weekends (depending on whether we're talking about a game or a productivity app) or holidays. This makes it even harder to develop accurate alerts.

The worst case is when issues get reported by customers before we see any alerts: this signifies a big gap and a postmortem should identify the follow up work (see postmortems below).

Mitigation

Several things can make mitigation harder. The on-call engineer might not know how to handle certain types of incidents.

It's a good idea to have a troubleshooting guide (TSG) for each type of alert, where an area expert details the steps to mitigate an issue.

Another common issue is there is no easy mitigation. This goes back to our coding section: code should be behind feature gates, so mitigation is as easy as flipping a switch.

Yet another common issue, which we covered in the previous section when we discussed alerts, is not having enough information to easily pinpoint the actual issue. The on-call engineer sees an incidents, knows something is wrong, but not enough information is available for a quick mitigation. Alerts should contain enough information to be useful.

Postmortems

Postmortems are an invaluable tool for learning from incidents. Postmortems are reviews of incidents once mitigated, root caused, and understood, where the team gets together to discuss what happened and take steps to prevent the same type of issue form happening in the future. A postmortem is not about blaming, it is about answering the following question:

What can we do to ensure this doesn't happen in the future?

A postmortem that doesn't answer this question is not that useful. A good postmortem identifies one or more work items that can be handed to engineering to implement additional guardrails so the same issue doesn't recur.

Repair Items

Finally, identifying repair items is not enough. A long backlog of repair items that nobody gets around to implement won't make things any easier.

Engineers should treat repair items with priority.

Repair items are some of the most critical work items: we've seen incidents in production, we know the scope of the impact, and we know the work needed to prevent them in the future.

Summary

In this post we looked at a model of software lifecycle, consisting of a proactive part: engineers writing code, a signoff process to promote a build, and signals to increase the audience for a build; and a reactive part: a live site incident, which on-call engineers mitigate, the team postmortems, and comes up with a set of repair items.

We also looked at some of the common issues across these various parts of the lifecycle, and some best practices.

This was a very high-level overview - each of these steps has a lot of depth, from writing safe code, to release management, to telemetry models, site reliability engineering, and so on. All of these are critical parts of shipping software.

Machine Learning on Azure - Part 3

Fri, 24 Sep 2021 00:00:00 -0700

Machine Learning on Azure - Part 3

This is an excerpt from chapter 7 of my book, Data Engineering on Azure, which deals with machine learning workloads. This is part 3 in a 3 part series. In this post, we'll run the model we created in part 1 on the Azure Machine Learning (AML) infrastructure we set up in part 2 .

Running ML in the cloud

We use the Python Azure Machine Learning SDK for this, so the first step is to install it using the Python package manager (pip). First, make sure pip is up-to-date. (If there is a newer pip version, you should see a message printed to the console suggesting you upgrade when you run a pip command.) You can update pip by running python -m pip install --upgrade pip as an administrator. Once pip is up-to-date, install the Azure Machine Learning SDK with the command in the following listing:

pip install azureml-sdk

Let's now write a Python script to publish our original ML model to the cloud, with all the required configuration. We'll call this pipeline.py.

from azureml.core import Workspace, Datastore, Dataset, Model
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.compute import AmlCompute
from azureml.core.conda_dependencies import CondaDependencies 
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps.python_script_step import PythonScriptStep
import os  

tenant_id = ''
subscription_id = ''
service_principal_id = ''
resource_group  = 'aml-rg'
workspace_name  = 'aml'

## Auth 
auth = ServicePrincipalAuthentication(
    tenant_id, 
    service_principal_id, 
    os.environ.get('SP_PASSWORD'))  

## Workspace 
workspace = Workspace( 
    subscription_id = subscription_id,
    resource_group = resource_group,
    workspace_name = workspace_name,
    auth=auth)

## Datastore 
datastore = Datastore.get(workspace, 'MLData')

## Compute target 
compute_target = AmlCompute(workspace, 'd1compute')

## Input 
model_input = Dataset.File.from_files( 
    [(datastore, '/models/highspenders/input.csv')]).as_mount()

## Python package configuration  
conda_deps = CondaDependencies.create(
    pip_packages=['pandas', 'sklearn', 'azureml-core', 'azureml-dataprep'])

run_config = RunConfiguration(conda_dependencies=conda_deps)

## Train step 
trainStep = PythonScriptStep( 
    script_name='highspenders.py',
    arguments=['--input', model_input],
    inputs=[model_input],
    runconfig=run_config,
    compute_target=compute_target)  

## Submit pipeline 
pipeline = Pipeline(workspace=workspace, steps=[trainStep])

published_pipeline = pipeline.publish(
    name='HighSpenders',
    description='High spenders model',
    continue_on_step_failure=False)

open('highspenders.id', 'w').write(published_pipeline.id)

We'll break down this script and discuss each part. First, we have the required imports and the additional parameters we need.

from azureml.core import Workspace, Datastore, Dataset, Model
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.compute import AmlCompute
from azureml.core.conda_dependencies import CondaDependencies 
from azureml.core.runconfig import RunConfiguration
from azureml.pipeline.core import Pipeline
from azureml.pipeline.steps.python_script_step import PythonScriptStep
import os  

tenant_id = ''
subscription_id = ''
service_principal_id = ''
resource_group  = 'aml-rg'
workspace_name  = 'aml'

We import a set of packages from the azureml-sdk. We need the tenant ID, subscription ID, and service principal ID we will use to connect to the Azure Machine Learning service. We created the service principal in part 2. We stored it in the $sp variable. In case you closed that PowerShell session and no longer have the $sp variable, you can simply rerun the scripts we covered in part 2 to create a new service principal and grant it the required permissions.

You can get the service principal ID from $sp.appId in PowerShell. Similarly, you can get the tenant ID from $sp.tenant. The subscription ID is the GUID of your Azure subscription.

Use these to intialize the tenant_id, subscription_id, and service_principal_id in the script above.

Next, we connect to the workspace using the service principal and get the data store (MLData) and compute target (d1compute) needed by our model. The following listing shows the steps.

## Auth 
auth = ServicePrincipalAuthentication(
    tenant_id, 
    service_principal_id, 
    os.environ.get('SP_PASSWORD'))  

## Workspace 
workspace = Workspace( 
    subscription_id = subscription_id,
    resource_group = resource_group,
    workspace_name = workspace_name,
    auth=auth)

## Datastore 
datastore = Datastore.get(workspace, 'MLData')

## Compute target 
compute_target = AmlCompute(workspace, 'd1compute')

Here we define a service principal authentication as Auth and use the environment variable SP_PASSWORD to retrieve the service principal secret. We set this variable in part 2, after we created the principal.

We connect to the Azure Machine Learning workspace with the given subscription ID, resource group, name, and auth. We then retrieve the datastore (MLData) and compute target (d1compute) from the workspace.

We need these to set up our deployment: the data store is where we have our input, while the compute target is where the model trains. The following listing shows how we can specify the model input.

## Input 
model_input = Dataset.File.from_files( 
    [(datastore, '/models/highspenders/input.csv')]).as_mount()

The from_files() method takes a list of files. Each element of the list is a tuple consisting of a data store and a path. The as_mount() method ensures the file is mounted and made available to the compute that trains the model.

Azure Machine Learning datasets reference a data source location, along with a copy of its metadata. This allows models to seamlessly access data during training.

Next, we'll specify the Python packages required by our model, from which we can initialize a run configuration. If you remember from part 1, we used pandas and sklearn. We'll also need the azureml-core and azureml-dataprep packages required by the runtime. The next listing shows how to create the run configuration.

## Python package configuration  
conda_deps = CondaDependencies.create(
    pip_packages=['pandas', 'sklearn', 'azureml-core', 'azureml-dataprep'])

run_config = RunConfiguration(conda_dependencies=conda_deps)

Conda stands for Anaconda, a Python and R open source distribution of common data science packages. Anaconda simplifies package management and dependencies and is commonly used in data science projects because it provides a stable environment for this type of workload. Azure Machine Learning also uses it under the hood.

Next, let's create a step for training our model. In our case, this is a PythonScriptStep, a step that executes Python code. We'll provide the name of the script (from our previous section), the command-line arguments, the inputs, run configuration, and compute target. The following listing shows the details.

## Train step 
trainStep = PythonScriptStep( 
    script_name='highspenders.py',
    arguments=['--input', model_input],
    inputs=[model_input],
    runconfig=run_config,
    compute_target=compute_target)

We specify the script to upload/run with script_name. This is our highspenders.py model we created in part 1. We set the arguments we want passed to the script as arguments. Here, model_input resolves at runtime to the path where the data is mounted on the node running the script. We set the inputs, run configuration, and compute target to run on as inputs, runconfig, and compute_target.

We can chain multiple steps together, but we only need one in our case. One or more steps form a ML pipeline.

An Azure Machine Learning pipeline simplifies building ML workflows including data preparation, training, validation, scoring, and deployment.

Pipelines are an important concept in Azure Machine Learning. These capture all the information needed to run a ML workflow. The following listing shows how we can create and submit a pipeline to our workspace.

## Submit pipeline 
pipeline = Pipeline(workspace=workspace, steps=[trainStep])

published_pipeline = pipeline.publish(
    name='HighSpenders',
    description='High spenders model',
    continue_on_step_failure=False)

open('highspenders.id', 'w').write(published_pipeline.id)

We create a pipeline with a single step, trainStep in our workspace. We publish the pipeline. We'll save the GUID of the published pipeline into the highspenders.id file so we can refer to it later.

This covers the whole pipeline.py script. Our pipeline automation is almost complete. But before calling this script to create the pipeline, let's make one small addition to our high spender model. While we could do all of the previous steps without touching our original model code, we add the final step to the model code itself. Remember that once the model is trained, we save it to disk as outputs/highspender.pkl.

For this step, we'll make one Azure Machine Learning-specific addition: taking the trained model and storing it in the workspace. Add the lines in the following listing to the highspenders.py model we created in part 1 (not to pipeline.py, which we just covered).

## Register model 
from azureml.core import Model 
from azureml.core.run import Run  

run = Run.get_context()
workspace = run.experiment.workspace
model = Model.register(
    workspace=workspace,
    model_name='highspender',
    model_path=model_path)

Note the call to Run.get_context() and how we use this to retrieve the workspace. In pipeline.py, we provided the subscription ID, resource group, and workspace name. That is how we can get a workspace from outside Azure Machine Learning. In this case, though, the code runs in Azure Machine Learning as part of our pipeline. This gives us additional context that we can use to retrieve the workspace at runtime. Every run of a pipeline in Azure Machine Learning is called an experiment.

Azure Machine Learning experiments represent one execution of a pipeline. When we rerun a pipeline, we have a new experiment.

We are all set! Let's run the pipeline.py script to publish our pipeline to the workspace. The following listing provides the command for this step.

python pipeline.py

The GUID matters! If we rerun the script, it registers another pipeline with the same name but a different GUID. Azure Machine Learning does not update pipelines in place. We have the option to disable pipelines so these don't clutter the workspace, but not to update those. Let's kick off the pipeline using Azure CLI as the next listing shows.

$pipelineId = Get-Content -Path highspenders.id

az ml run submit-pipeline `
--pipeline-id $pipelineId `
--workspace-name aml `
--resource-group aml-rg

We read the pipeline ID from the highspenders.id file produced in the previous step into the $pipelineId variable. We then submit a new run.

Check the UI at https://ml.azure.com. You should see the pipeline under the Pipelines section, the run we just kicked off under the Experiments section. Once the model is trained, you'll see the model output under the Models section.

Azure Machine Learning recap

After implementing a model in Python, we started with provisioning a workspace, which is the top-level container for all Azure Machine Learning-related artifacts. Next, we created a compute target, which specifies the type of compute our model runs on. We can define as many compute targets as needed; some models require more resources than others, some require GPUs, etc. Azure provides many types of VM images suited to all these workloads. A main advantage of using compute targets in Azure Machine Learning is that compute is provisioned on demand when we run a pipeline. Once the pipeline finishes, compute gets deprovisioned. This allows us to scale elastically and only pay for what we need.

We then attached a data store. Data stores are an abstraction over existing storage services, and these allow Azure Machine Learning connections to read the data. The main advantage of using data stores is that these abstract away access control, so our data scientists don't need to worry about authenticating against the storage service.

With the infrastructure in place, we proceeded to set up a pipeline for our model. A pipeline specifies all the requirements and steps our execution needs to take. There are many pipelines in Azure: Azure DevOps Pipelines are focused on DevOps, provisioning resources, and in general, providing automation around Git; Azure Data Factory pipelines are focused on ETL, data movement, and orchestration; Azure Machine Learning Pipelines are meant for ML workflows, where we set up the environment and then execute a set of steps to train, validate, and publish a model.

Our pipeline included a dataset (our input), a compute target, a set of Python package dependencies, a run configuration, and a step to run a Python script. We also enhanced our original model code to publish the model in AML. This takes the result of our training run and makes it available in the workspace. Then we published the pipeline to our Azure Machine Learning workspace and submitted a run, which in Azure Machine Learning is called an experiment.

Next steps

We will stop here with the series of article. Grab the book to see how we can apply DevOps to our ML scenario. In the book, we go over putting both the model code and pipeline.py in Git, then deploy updates using Azure DevOps Pipelines. We also cover orchestrating ML runs with Azure Data Factory, which includes getting the input data ready, running an Azure Machine Learning experiment, and handling the output.

All of this and more in Data Engineering on Azure.

Machine Learning on Azure - Part 2

Fri, 17 Sep 2021 00:00:00 -0700

Machine Learning on Azure - Part 2

This is an excerpt from chapter 7 of my book, Data Engineering on Azure, which deals with machine learning workloads. This is part 2 in a 3 part series. In this post, we'll explore the Azure Machine Learning (AML) service and set it up in preparation of running our model in the cloud.

In this post, like throughout the book, we'll be using PowerShell Core and Azure CLI to interact with Azure. You can install PowerShell Core from here. Then from the PowerShell Core shell, you can install Azure CLI using the following command:

Invoke-WebRequest -Uri https://aka.ms/installazurecliwindows `
-OutFile .\AzureCLI.msi; Start-Process msiexec.exe `
-Wait -ArgumentList  '/I AzureCLI.msi /quiet'; rm .\AzureCLI.msi

Once Azure CLI is installed, you can use az commands as we'll see throughout this post. First, log into your Azure account:

az login

Introducing Azure Machine Learning

Azure Machine Learning is Microsoft's Azure offering for creating and managing ML solutions in the cloud. An instance of Azure Machine Learning is called a workspace.

A workspace is the top-level resource for Azure Machine Larning, providing a centralized place to work with all the artifacts you create.

In this post, we'll create and configure a workspace, then we'll look at everything needed for taking our high spender model we developed in part 1 from our local machine and running it on Azure.

Creating a workspace

We'll start by using Azure CLI to create a workspace. First, we install the azure-cli-ml extension, then we create a new resource group called aml-rg to host our ML workloads, and finally, we create a workspace in the new resource group. The following listing shows the steps:

az extension add -n azure-cli-ml

az group create `
--location "Central US" `
--name aml-rg  

az ml workspace create `
--workspace-name aml `
--location "Central US" `
--resource-group aml-rg

The first line adds the azure-cli-ml extension. The second line creates the aml-rg resource group in the Central US Azure region. The last command creates a new Azure Machine Learning workspace named aml in the resource group.

The same way Azure Data Explorer (ADX) has a web UI accessible at https://dataexplorer.azure.com/ and Azure Data Factory has a web UI accessible at https://adf.azure.com/, Azure Machine Learning also has a web UI that you can find at https://ml.azure.com/. We will stick to the Azure CLI and the Python SDK to provision resources, but I encourage you to try the web UI. As we create more artifacts in this section, you can use the web UI to see how these are represented there. If you visit the web UI, you will see a navigation bar on the right with three sections: Author, Assets, and Manage. The following figure shows the navigation bar.

The Author section contains Notebooks, Automated ML, and Designer. We won't focus on these but here is a quick walkthrough: Notebooks enables users to store Jupyter notebooks and other files directly in the workspace; Automated ML is a codeless solution for implementing ML; and the Designer is a visual drag-and-drop editor for ML. We won't focus on these features because these facilitate model development. We'll look at the DevOps aspects of ML using our existing Python model as an example, so this is less relevant for us. Of course, we could've built our model in Azure Machine Learning directly, but this way, we learn how we can onboard a model that wasn't created specifically to run on Azure Machine Learning.

We will, however, touch on most of the items in the Assets and Manage sections. Assets are some of the concepts Azure Machine Learning deals with, such as Experiments and Models. We'll cover these soon. The Manage section deals with the compute and storage resources for AML. Let's zoom in on these.

Creating an Azure Machine Learning compute target

One of the great features of Azure Machine Learning is that it can automatically scale compute resources to train models. Remember, compute in the cloud refers to CPU and RAM resources. A virtual machine (VM) in the cloud provides CPU and RAM, but it incurs costs as long as it runs. This is especially relevant for ML workloads, which might need a lot of resources during training and training might not happen continuously.

For example, maybe our high spender model needs to be trained every month to predict next month's marketing campaign targets. It would be wasteful to keep a VM running all the time if we only need it one day of the month. Of course, we could manually turn it on or off, but Azure Machine Learning gives us an even better option - compute targets.

A compute target specifies a compute resource on which we want to run our ML. This includes the maximum number of nodes and the VM size.

As a reminder, Azure has a set of defined VM sizes, each with different performance characteristics and associated costs.¹ A compute target specifies which VM type and how many instances we'll need, but it won't provision the resources until we run a model and request this target. Once the model run finishes, the resources are deprovisioned. This makes Azure Machine Learning compute elastic: resources are allocated when needed, then freed up automatically. We only pay for what we use, and the service takes care of all the underlying infrastructure.

Let's specify a compute target for our example. We'll request, at most, one node, use the economical STANDARD_D1_V2 VM size (1 CPU, 3.5 GiB memory), and name it d1compute. The following listing shows the corresponding Azure CLI command:

az ml computetarget create amlcompute `
--max-nodes 1 `
--name "d1compute" `
--vm-size STANDARD_D1_V2 `  
--workspace-name aml `
--resource-group aml-rg

This won't cost us anything until we actually run a ML workload. If you click through the UI to the Compute section and navigate to Compute Clusters, you should see the new definition. Other compute options in AML are compute instances that include:

VMs preimaged with common ML tools and libraries.
Inference clusters, where we can package and deploy models on Kunbernetes and expose these as REST endpoints.
Attached compute the enables us to target compute resources like Azure Data Bricks not managed by Azure Machine Learning.

Let's move on to storage now. We'll see how we can make our input available to Azure Machine Learning.

Setting up Azure Machine Learning storage

We'll start by uploading our input.csv file from the previous section to an Azure Data Lake Storage (ADLS) account. Let's first create the account and a filesystem named fs1. In the code samples below, make sure to replace with an actual name. This name needs to be unique across Azure, since it becomes part of the URL used to address the storage account, so we can't hardcode the name in the example.

az group create `
--location "Central US" `
--name adls-rg

az storage account create `
--name <Your ADLS account> `
--resource-group adls-rg `
--enable-hierarchical-namespace true

az storage fs create `
--account-name <Your ADLS account> `
--name fs1

The first command creates a new resource group named adls-rg. The second command provisions an Azure Data Lake Storage Gen2 account in the resource group. The last command creates a filesystem named fs1 in the storage account.

Let's now upload the input.csv file created in part 1 to the filesystem. For this, we'll use the Azure CLI upload command to upload our input file under the models/highspenders/input.csv path. The next listing shows the commands.

az storage fs file upload `
--file-system fs1 ` 
--path "models/highspenders/input.csv" ` 
--source input.csv ` 
--account-name <Your ADLS account>

In practice, we would have various Azure Data Factory pipelines copying datasets to our storage layers. From there, we would need to make these datasets available to Azure Machine Learning. We'll do this by attaching a datastore.

An Azure Machine Learning datastore enable us to connect to an external storage account like Azure's Blob Storage, Data Lake, SQL, Databricks, etc., making it available to our ML models.

First, we need to provision a service principal that Azure Machine Learning can use to authenticate. We will create a new service principal in Azure Active Directory (AAD) and grant it Storage Blob Data Contributor rights on the data lake. This allows the service principal to read and write data in the data lake. The following listing shows the steps.

$sp = az ad sp create-for-rbac | ConvertFrom-Json

$acc = az storage account show `
--name <Your ADLS account> | ConvertFrom-Json

az role assignment create `
--role "Storage Blob Data Contributor" `
--assignee $sp.appId `
--scope $acc.id

The first command creates a principal stored in $sp for role-base access control (RBAC). The second command retrieves the details of the Azure Data Lake Storage account and stores it in $acc. The last command creates a new role assignment, granting read/write access on the storage account to the service principal we just created.

The service principal can now access data in the storage account. The next step is to attach the account to Azure Machine Learning, giving it the service principal ID and secret so it can use these to connect to the account. The following listing shows how to do this.

az ml datastore attach-adls-gen2 `
--account-name <Your ADLS account> `
--client-id $sp.appId `
--client-secret $sp.password `
--tenant-id $sp.tenant ` 
--file-system fs1 `
--name MLData 
--workspace-name aml `
--resource-group aml-rg

This attaches an Azure Data Lake Storage Gen2 datastore to Azure Machine Learning, using a service principal to authenticate. We need to supply the data lake account, the service principal ID, secret, and tenant, the filesystem we want to attach, and the name we want to give it in Azure Machine Learning (MLData in our case).

Now if you navigate to the Storage section in the UI, you should see the newly created MLData datastore. In fact, you should see a couple more datastores that are created by default and used within the workspace. In practice, we need to connect to external storage, and data stores are the way to do that.

Our workspace is now configured with both a compute target and an attached data store. Let's grant our service, principalContributor, rights to the Azure Machine Learning workspace too, so we can use it for deployment. Note, in a production environment, we would have separate service principals for better security, then if one of the principals gets compromised, it has access to fewer resources. We'll reuse our $sp service principal, though, to keep things brief. The following listing shows how to grant the rights.

$aml = az ml workspace show `
--workspace-name aml `
--resource-group aml-rg `
| ConvertFrom-Json  

az role assignment create `
--role "Contributor" ` 
--assignee $sp.appId `
--scope $aml.id

The first command gets the details of an Azure Machine Learning workspace and stores them in the $aml variable. The second command creates a role assignment, granting the Contributor role to the service principal $sp on the workspace.

We'll also store the service principal's password in an environment variable so that we can read it without having to embed it into the code. The following listing shows how to set an environment variable in a PowerShell session. This won't get persisted across sessions, so make a note of $sp.password.

$env:SP_PASSWORD = $sp.password

The name password is a bit misleading. This is an autogenerated client secret that was created when we ran az ad sp create-for-rbac (which stands for Azure Active Directory service principal create for role-based access control). We are all set. The next step is to publish our Python code and run it in the cloud. We will do that in part 3.

For more on VM sizes and costs, see https://docs.microsoft.com/en-us/azure/virtual-machines/sizes. ↩

Machine Learning on Azure - Part 1

Fri, 10 Sep 2021 00:00:00 -0700

Machine Learning on Azure - Part 1

This is an excerpt from chapter 7 of my book, Data Engineering on Azure, which deals with machine learning workloads. This will be a series of 3 posts. In this post, we'll create a simple ML model in Python. In the next post, we'll go over Azure Machine Learning. In the final post, we'll run this model in Azure Machine Learning. Let's start with the simple ML model.

Training a machine learning model

This model predicts whether a user is likely to be a high spender, based on the number of sessions and page views on our website. A session is a website visit in which the user views one or more pages. Let's assume that the amount of money a user spends on our products is correlated to the number of sessions and page views. We'll consider a user a high spender if they spend $30 or more.

The following table shows our input data: the user's ID, the number of sessions, the number of page views, the amount of dollars spent, and whether we consider the user a high spender.

User ID	Sessions	Page views	Total spent	High spender
1	10	45	100	Yes
2	5	10	30	Yes
3	1	5	10	No
4	2	2	0	No
5	9	33	95	Yes
6	7	5	5	No
7	19	31	95	Yes
8	1	20	0	No
9	2	17	0	No
10	8	25	40	Yes

The listing shows the input CSV file corresponding to the table that we'll use for training, input.csv.

UserId,Sessions,PageViews,TotalSpend,HighSpender 
1,10,45,100,Yes 
2,5,10,30,Yes 
3,1,5,10,No 
4,2,2,0,No 
5,9,33,95,Yes 
6,7,5,5,No 
7,19,31,95,Yes 
8,1,20,0,No 
9,2,17,0,No 
10,8,25,40,Yes

You need to create this file on your machine as input.csv. We are working with simple input and a simple model because our focus is taking a model and putting it into production, not in building the model itself. There are plenty of great references covering model development and ML if you are interested in the topic.

Assuming you already have Python on your machine, let's start by installing the two packages we need for our model: pandas and scikit-learn (also known as sklearn). The following listing shows the command to install these packages using the Python package manager, pip. If you don't have Python, you can install it from https://www.python.org/downloads/.

pip install pandas sklearn

Now that we have our input file and packages, let's look at the high spender model itself. Don't worry if you haven't implemented a ML model before; our model has only a few lines of code and is very basic. We'll walk through the steps that should give you at least a high-level understanding.

Training a model using Scikit-learn

Our model takes an --input argument, representing the input CSV. It reads this file into a Pandas DataFrame.

A DataFrame is a fancy table data structure offered by the Pandas library. It provides various useful ways to slice and dice the data.

We'll split the data into the features used to train the model (X) and what we are trying to predict (y). In our case, we will take the Sessions and PageViews columns from the input as X, and the HighSpender column as y. This model doesn't care about user ID and the exact amount spent, so we can ignore those columns.

We will split our input data so that we take 80% of it to train the model and use the remaining 20% to test our model. For the 20%, we will use the model to predict whether the user is a high spender and see how our prediction compares with the actual data. This is a common practice for measuring model prediction accuracy.

We will use KNeighborsClassifier from scikit-learn. This implements a well-known classification algorithm, the k-nearest neighbors vote. We use a classification algorithm because we want to classify our users into high spenders and non-high spenders. We won't cover the details of the algorithm here, but the good news is that this is fully encapsulated in the scikit-learn library so we can create it with one line of code and train it with a second line of code. We will use the training data to train the model, then try to predict on the test data and print the predictions. The figure shows these steps:

Steps for training our model:

Extract features and target the value from the input.
Split the dataset into train and test data.
Train the model on the training data.
Use the model to predict on the test data, comparing predictions with actual data.

Finally, we will save the model on disk as outputs/highspender.pkl. The idea is that once we have a trained model, another system picks it up and uses it to predict new data. For example, as users visit our website, we can use the model to predict who is likely to be a high spender and maybe offer them a discount. Or maybe we want to encourage non-high spenders to spend more time on the website, hoping it converts them into high spenders. Either way, some other service has to load this model and feed it never-before-seen data, and the model will predict if the user is likely to be a high spender or not.

High spender model implementation

Training a model might sound like a lot, but it is only 25 lines of Python code as the following listing shows:

import argparse 
from joblib import dump 
import os 
import pandas as pd 
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.model_selection import train_test_split  

parser = argparse.ArgumentParser()
parser.add_argument('--input', type=str, dest='model_input')

args = parser.parse_args() 
model_input = args.model_input
df = pd.read_csv(model_input)

X = df[["Sessions", "PageViews"]]
y = df["HighSpender"]

X_train, X_test, y_train, y_test = train_test_split(X, y,
  test_size=0.2, random_state=1)

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

score = knn.predict(X_test)

predictions = X_test.copy(deep=True)
predictions["Prediction"] = score
predictions["Actual"] = y_test

print(predictions)

if not os.path.isdir('outputs'): os.mkdir('outputs')  

model_path = os.path.join('outputs', 'highspender.pkl')
dump(knn, model_path)

We'll save this Python script as highspenders.py.

Let's break it down and explain each step. First, we import all the libraries we need. Next, we set up command line argument parsing to expect an --input argument:

parser = argparse.ArgumentParser()
parser.add_argument('--input', type=str, dest='model_input')

We then grab the input file path from the command line argument and load the file into a Pandas DataFrame:

args = parser.parse_args() 
model_input = args.model_input
df = pd.read_csv(model_input)

Then we define the model inputs as the Sessions and PageViews columns and the output (prediction) as the HighSpender column:

X = df[["Sessions", "PageViews"]]
y = df["HighSpender"]

Next, we split the input data into training data and data reserved for testing using a 0.2 ratio:

X_train, X_test, y_train, y_test = train_test_split(X, y,
    test_size=0.2, random_state=1)

We select the KNeighborsClassifier with default settings:

knn = KNeighborsClassifier()

Then we train the model on the training data:

knn.fit(X_train, y_train)

Next, we set the prediction score using the trained model on the test data:

score = knn.predict(X_test)

We then format the output, copying it into a new DataFrame and adding Prediction and Actual columns:

predictions = X_test.copy(deep=True)
predictions["Prediction"] = score
predictions["Actual"] = y_test

Finally, we print the predictions to the console, ensure we have an outputs/ directory, and save the model as outputs/highspender.pkl:

print(predictions)

if not os.path.isdir('outputs'): os.mkdir('outputs')  

model_path = os.path.join('outputs', 'highspender.pkl')
dump(knn, model_path)

Let's run the script and check the output. The following listing shows the console command for running the model:

python highspenders.py --input input.csv

You should see the test predictions and actual data printed to the console. You should also now see the outputs/highspender.pkl model file. Strictly speaking, we don't need the prediction and printing part, but it should help if we want to play with the model.

Here, we're using a small input size. The larger the input dataset, the better the accuracy. But again, our focus is taking this Python script and running it in the cloud. The good news is that our approach to DevOps (or MLOps) scales to more complex models and larger inputs. In the next post, we'll look at Azure Machine Learning, the Azure PaaS (platform as a service) offering for running ML in the cloud. We'll then connect the dots and get this model running in Azure Machine Learning in part 3.

Shipping a Feature

Thu, 12 Aug 2021 00:00:00 -0700

Shipping a Feature

I recently did a small presentation for my team talking about what it takes to ship a feature to customers in a product like Office. This includes much more than getting the functionality implemented and tested. In fact, figuring out the design and implementing it accounts for about 20% of the work (from prototype to implementation, including tests). The remaining 80% is taking the feature from functional to shippable. In this post, I'll go over some of these non-functional aspects of shipping code to customers.

Most of the aspects I will talk about rely heavily on telemetry. This is a consequence of today's connected world. Software is shipped over the internet every month or every week (or every day), as opposed to every year or every other year. Connectivity also enables us to get signals on how the software is performing and adjust accordingly.

Reliability

Reliability can make or break a feature. It is the bedrock of quality. If the feature is unreliable and causes crashes or data corruption, nobody will use it. The most egregious of reliability issues is data loss - for example, if a user writes a paragraph and the application crashes before this paragraph gets saved anywhere, forcing the user to re-type it causes a huge loss of trust.

In terms of software engineering, beyond comprehensive testing, we need to have telemetry in place to report errors. There are several ways to quantify reliability. For example, a couple of common measures are the ICE and ACE scores.

ICE stands for Ideal Customer Experience. This measures the reliability of a scenario including all errors encountered during a user's session. 100% means no session encountered any errors.

ACE stands for Adjusted Customer Experience. This is similar to ICE, except it only measures unexpected errors. In a complex system, some errors are expected, and the software can recover from them without impacting the user experience. For example, a background call fails but succeeds on retry.

Both measures are important - one measures the overall reliability of the system, the other measures the user experience impact.

100% ICE and ACE scores are ideal, but hard to achieve in a complex system. That said, shipping a feature should include setting a target ICE and ACE score (for example 99%), making sure telemetry is there to provide the signal, and working towards that goal as the feature gets exposed to more and more users (more on exposure below).

Performance

Performance is another fundamental aspect of any feature. Of course, the first rule of performance tuning is to measure. Shipping a feature includes instrumentation to measure various scenarios like time to load, time to render etc. Customers run the code on different types of hardware, and it is important to understand what the actual user experience is (in the 95th percentile).

Like reliability, we should set some goals, look at telemetry, and keep improving. Important to note that this is not a one-time thing, as software gets updated and functionality is added, performance can easily degrade. Performance numbers should show up on a dashboard and should be reviewed periodically.

Security

Microsoft created the Security Development Lifecycle for building better, safer, software internally. In 2008, this was shared with the rest of the world and it continues evolving. I won't cover all security best practices as it would take way more than a blog post, but I will touch on a few points:

Threat modeling - non-trivial features should have a threat model - a data flow diagram which captures the various entities in the system and the trust boundaries. The Microsoft Threat Modeling Tool can analyze a threat model and list out potential attacks and suggested mitigations.
Adhering to standards and best practices - for example always connecting over HTTPS instead of HTTP, using cryptographically strong algorithms and so on.
Leveraging available tools to ensure a feature is secure. For example, running static code analysis to identify potential attack vectors and fuzzing inputs.

The bar for security should be very high, and a feature should never roll out to customers before ensuring it is secure.

Privacy

Customer privacy is extremely important. If our feature handles any customer data (data about the customer) or customer-owned data (data generated by the customer), we need to be very careful how we process it, store it, and who can access it. I talked extensively about data handling in my previous blog post, Changing Data Classification Through Processing.

Compliance

Compliance ensures we don't open ourselves up to liability. An example of this is third party notices for open-source libraries - many open-source licenses require you to mention you are using the library. Another example is GDPR compliance - EU citizens can request to view all the data a company has on them, or request to be forgotten (have their data removed from the system).

Compliance also means our software adheres to various standards some customers might require. An example of this is data sovereignty: countries like Germany and China require data to be stored in data centers within their country. Another example is ensuring our software meets the requirements for certain certifications like SOC2, without which certain organizations wouldn't be able to use it.

Our feature needs to be compliant before shipping it to customers.

Accessibility

The Microsoft mission statement reads as follows:

Our mission is to empower every person and every organization on the planet to achieve more.

Making our software accessible to people with disabilities is very important. Here are a few aspects we need to consider when shipping features users interact with:

Contrast ratio - the contrast between foreground colors and background colors needs to be good enough so everything is legible for people with vision impairment.
High contrast - operating systems provide high-contrast modes for the visually impaired. Features need to be tested in high-contrast mode to ensure they are still usable.
Screen reader support - UI elements need to be annotated so screen reader software can describe them correctly.
Keyboarding - some users navigate exclusively using keyboard shortcuts, so a feature needs to implement common keyboard shortcuts.
Focus - a feature needs to properly handle focus switching. This includes tab order, making sure focus doesn't get trapped on a single control or subset of the controls etc.
Touch - touchscreen devices without keyboard usually have larger touch targets. We need to make sure our feature works well on touchscreens too.

World readiness

World readiness means our feature can ship in markets across the world, and usually entails two different aspects: globalization and localization.

Globalization means our feature works in different cultures, with different (OS-level) culture settings, without requiring additional, culture-specific changes. An example is the date format. In US, the date format is mm-dd-yyyy, while in Europe it is dd-mm-yyyy. Japan uses yyyy-mm-dd. When displaying dates, we should format them using the current culture's format, not assume and hardcode any specific format.

Another important aspect of globalization is to ensure the symbols used as icons make sense for everyone. For example, using a starry wizard hat to represent a setup wizard works in Western cultures but might not make sense all around the world.

Localization deals with translating UI strings into all the languages in which the feature ships. From a developer perspective, UI layout is important here: in some languages, words are on average longer than in English, so text might overflow its boundaries when translated. For example, Add job in English becomes Auftrag hinzufÃ¼gen in German.

We also have right-to-left languages like Arabic and Hebrew. We need to ensure our layout works as expected in such languages.

Feature gating

In a mature product like Office, making changes feels very much like rebuilding an airplane in flight: we can add and modify features, but breaking anything is catastrophic. One best practice of shipping features in such conditions is using feature gates. A feature gate is a toggle that determines whether a code path should be exercised or not. This allows us to release a build containing a half-developed feature. If it is not quite ready to see the light of day, the feature gate is closed, and the code never runs in production.

Once the feature is ready, we can toggle the feature gate and start running the code. In case we notice any issues, we can flip the feature gate back off and mitigate customer impact while the issue gets resolved.

Once the feature is mature and has been exercised by customers for a while, we can go back and clean up the feature gate.

Flighting

Feature gates are a simple on/off switch. Flighting changes is a more mature version of progressive exposure. We usually have a set of rings through which we release. For example, we could have a dogfood ring, for the team to try out the code themselves, before shipping further; we could have a beta tester ring, for customers who sign up to get preview features; finally, we could have a production ring, containing all other customers.

We should be able to progress a feature through these different rings, and within the rings, only expose a percentage of customers to the feature.

Flighting infrastructure needs to exist for this. From a feature development perspective, we need a roll out plan defining what telemetry signals we want to see to be comfortable increasing exposure of our feature.

Experimentation

Finally, we need a way to measure whether a feature is successful or not and determine whether iterating on it improves things or makes them worse. First, we need to define what success means - what metrics are most important for our product. Once we have these, we can run an A/B test when introducing a new feature or iterating on an existing feature. We have a control group seeing the old behavior and a treatment group seeing the new behavior. We can then look at the key metrics we defined and see how they look for both control and treatment. Did the new code move the needle?

Summary

Shipping a feature takes a lot of work beyond the initial functional implementation. In this post we looked at some of these various aspects:

Reliability
Performance
Security
Privacy
Compliance
Accessibility
World readiness
Feature gating
Flighting
Experimentation

All of these need to be taken into account when we ship code to our customers.

Data Engineering on Azure RTM

Mon, 21 Jun 2021 00:00:00 -0700

Data Engineering on Azure RTM

My book, Data Engineering on Azure, which I announced in this blog post, is going to print soon. As I did with my previous book, Programming with Types, I'm writing another RTM post to talk about a few aspects of the process.

Title evolution

When I pitched the book to Manning, I used the title Production Data Engineering with Azure. The title was supposed to capture that this is a book about the practical aspects of data engineering, with examples on the Azure platform. In fact, here is how I described the book's topic in the proposal:

The same way Software Engineering brings engineering rigor to software development, Data Engineering aims to bring rigor to working with data in a reliable way.

This book is about implementing the various aspects of a big data platform - data ingestion, running analytics and ML, distributing data - in a real-world production system. The focus is on operational aspects like DevOps, monitoring, scale, and compliance. Examples will be provided using Azure services.

There is a big gap between what it takes, for example, to implement an ML model in Python and what it takes to run in a production environment, on a regular basis, with robust guardrails in place. The book focuses on the latter, which makes it different than other data platform books.

The Manning team has a lot of experience putting together books (and selling them). We iterated on the title quite a few times, trying to best capture the essence of the book. Once we started the project, we changed the name from Production Data Engineering with Azure to Practical Data Engineering on Azure.

Before launching the book as a Manning Early Access Preview (MEAP), we changed the name again, this time to Azure Data Engineering: the Practical part of the title made it a bit too long and not very clear.

As the manuscript was wrapping up, we took another look at the title: Azure Data Engineering implies the book is Azure-specific. While all the examples provided are built in the Azure cloud, my hope is the patterns and ideas discussed apply to any big data platform, in any cloud. We iterated on the title again, to emphasize the data engineering part, and ended up with Data Engineering on Azure. This is the final title of the book.

Articles and excerpts

Before starting the project, I wrote a few articles on the topic. The first one was Notes on Data Engineering. Soon after, my team launched the Data Science @ Microsoft Medium publication, where I contributed several articles:

How we built self-serve data environment tools with Azure.
Azure Data Explorer at the Azure business scale.
Running machine learning at scale.
Common data engineering challenges and their solution - which is a retake on that first article (Notes on Data Engineering).
Partnering for data quality.
Partnering for metadata management.
Data distribution.

Most of the ideas from these articles show up in the book. While the articles talk about the specific challenges my team encountered and the solutions we came up with to solve them, the book covers patterns - the general types of problems you would encounter while building a big data platform, and solutions you could apply. The articles helped me to clarify (for myself) the topics I wanted to cover in the book, and refine the proposed solutions.

Once the manuscript was well underway, I wrote a blog post on Data Quality Testing Patterns to clarify my thoughts as I was working on chapter 9 (Data quality), but otherwise switched my focus from articles to getting the book done. At this point, I started publishing excerpts from the book. So far I wrote about Changing data classification through processing and Ingesting data, with more to come.

The speed of the cloud

Innovation in cloud computing moves at a break-neck speed. The technology changes so fast, it is hard to pin things down in written form. For setting up various Azure services, I wanted to rely on command line scripts instead of the Azure Portal UI - walking readers through series of screenshots is tedious, and UI changes all the time. I used Azure CLI instead. That said, many of the extensions I used throughout the book are currently experimental, which means they might change at a future time. I also found a couple of bugs I reported to the teams maintaining the Azure CLI extensions.

Another example of the speed of innovation is Azure Purview. When I started working on our data platform, there was no Azure Purview and my team had to develop a home-grown solution to address our data inventory needs. We then got to use a preview, in-development version of Azure Purview before it was publicly announced (one of the perks of working at Microsoft). Chapter 8 of my book covers metadata management, with the reference implementation on Azure Purview. That meant I wasn't able to start on this chapter until Azure Purview was officially announced, even though I knew what I wanted to write about. Things lined up pretty well, I finished chapter 7 and had to skip to chapter 9, but as I was working on that, Azure Purview went into public preview.

This was a very interesting experience, very different than my previous book. Writing my first book, I didn't feel like there were so many moving parts to get a handle on and the speed with which things changed wasn't overwhelming. Even so, I'm confident the patterns I cover in the book will remain the same for quite some time, regardless of the technologies used to implement them. So even as new services launch and the ways we interact with the cloud evolve, the key takeaways should stay relevant.

Check out my book here: Data Engineering on Azure.

Ingesting Data

Fri, 12 Mar 2021 00:00:00 -0800

Ingesting Data

This is an excerpt from chapter 2 of my book, Data Engineering on Azure, which deals with storage. In this article we'll look at a few aspects of data ingestion: frequency and load type, and how we can handle corrupted data. We'll use Azure Data Explorer as the storage solution, but keep in mind that the same concepts apply regardless of the data fabric used. Code samples are omitted from this article, though available in the book. Let's start by looking at the frequency with which we ingest data.

Ingestion frequency

Frequency defines how often we ingest a given dataset. This can range from continuous ingestion, for streaming data, to yearly ingestion - a dataset which we only need to ingest once a year. For example, our website team produces web telemetry which we can, if we want to, ingest in real time. If our analytics scenarios include some real time or near real time processing, we can bring the data into our data platform as it is being generated.

The following figure shows this streaming ingestion setup.

As users visit website pages, each visit is sent as an event to an Azure Event Hub. Azure Data Explorer ingests data into the PageViews table in real time.

Azure Event Hub is a service that can receive and process millions of events per second. An event contains some data payload sent by the client to the Event Hub. A high-traffic website could treat each page view as an event and pass the user ID, URL, and timestamp to an Even Hub. From there, data can be routed to various other services. In our case, it can be ingested in Azure Data Explorer in real time.

Another option, if we don't have any real time requirements, is to ingest the data on some regular cadence, for example every midnight we load the logs for the day.

The following figure shows this alternative setup.

Logs get copied from the website Azure Data Explorer cluster to our Azure Data Explorer cluster using an Azure Data Factory. Copy happens on a daily basis.

In this case, the website team stores its logs into a dedicated Azure Data Explorer cluster. Their cluster only stores data for the past 30 days since it is used just to measure the website performance and debug issues. Since we want to keep data for longer for analytics, we want to copy it to our cluster and preserve it there.

Azure Data Factory is the Azure ETL service, which enables serverless data integration and transformation. We can use a Data Factory to coordinate when and where data gets moved. In our case, we copy the logs of the previous day every night and append them to our PageViews table.

Let's take another example: the sales data from our Payments team. We use this data to measure revenue and other business metrics. Since not all transactions are settled, it doesn't make sense to ingest this data daily. Our Payments team curates this data and officially publishes the financials for the previous month on the first day of each month. This is an example of a monthly dataset, one we would ingest once it becomes available, on the 1st of each month.

The following figure shows this ingestion.

Sales data gets copied from the Payments team's Azure SQL to our Azure Data Explorer cluster on a monthly cadence.

This is very similar to our previous Azure Data Factory ingestion of page view logs, the difference being the data source - in this case we ingest data from Azure SQL, and the ingestion cadence - monthly instead of daily.

Let's define the cadence of when a dataset is ready for ingestion as its grain.

The grain of a dataset specifies the frequency at which new data is ready for consumption. This can be continuous for streaming data, hourly, daily, weekly, and so on.

We would ingest a dataset with a weekly grain on a weekly cadence. The grain is usually defined by the upstream team producing the dataset. Partial data might be available earlier, but the upstream team can usually tell us when the dataset is complete and ready to be ingested.

While some data, like the logs in our example, can be ready in real time or at a daily grain, there are datasets who get updated once a year. For example, businesses use fiscal years for financial reporting, budgeting and so on. These datasets only change year over year.

Another ingestion parameter is the type of data load.

Load type

Outside of streaming data, where data gets ingested as it is produced, we have two options for updating a dataset in our system. We can perform a full load or an incremental load.

A full load means we fully refresh the dataset, discarding our current version and replacing it with a new version of the data.

For example, our Customer Success team has the list of active customer issues. As these issues get resolved and new issues appear, we perform a full load whenever we ingest the active issues into our system.

The usual pattern is to ingest the updated data into a staging table, then swap it with the destination table, as show in the following figure.

Queries are running against the ActiveIssues table. We ingest the data into the ActiveIssuesStaging table. Queries are still running against the old ActiveIssues table. We swap the two tables. Queries already started before the swap will run against the old tables, queries started after the swap will run against the new table. Finally, we can drop the old table.

Most storage solutions offer some transactional guarantees on renames to support scenarios like this. This means if someone is running a query against the ActiveIssues table, there is no chance of the query failing due to the table not being found or of the query getting rows from both the old and the new table. Queries running in parallel with a rename are guaranteed to either hit the old or the new table.

The other type of data load is incremental.

An incremental load means we append data to the dataset. We start with the current version and enhance it with additional data.

Let's take as an example a PageViews table. Since the Website team only keeps logs around for 30 days and we want to maintain a longer record when we ingest the data into our system, we can't fully refresh the PageViews table. Instead, every night we take the page view logs of the previous day and we append them to the table.

One challenge of incremental loads is to figure out exactly what data is missing (that we need to append), and what data we already have. We don't want to append again data we already have, as it would create duplicates.

There are a couple of ways we can go about determining the delta between upstream and our storage. The simplest one is contractual: the upstream team guarantees that data will be ready at a certain time or date. For example, the Payments team promises that the sales data for the previous month will be ready on the 1st, by noon. In that case, on July 1st we will load all sales data with a timestamp within June and append it to the existing sales data we have in our system. In this case, the delta is June sales.

Another way to determine the delta is to keep track on our side of what is the last row we ingested and only ingest from upstream data after this row. This is also known as a watermark. Whatever is under the watermark is data we already have in our system. Upstream can have data above the watermark, which we need to ingest.

Depending on the dataset, keeping track of the watermark can be very simple or very complex. In the simplest case, if the data has a column where values always increase, we can simply see what the latest value is in our dataset and ask upstream for data with values greater than our latest.

We can then ask for page views with a timestamp greater than the watermark when we append data in our system.

Other examples of ever-increasing values are auto-incrementing columns, like the ones we can define in SQL.

Things get more complicated if there is no easy ordering of the data from which we can determine our watermark. In that case, the upstream system needs to keep track of what data it already gave us, and hand us a watermark object. When we hand back the object, upstream can determine what is the delta we need. Fortunately, this scenario is less common in the big data world. We usually have simpler ways to determine delta, like timestamps and auto-incrementing IDs.

What happens though when a data issue makes its way into the system? We got the sales data from our Payments team on July 1st, but the next day we get notified that there was an issue: somehow a batch of transactions was missing. They fixed the dataset upstream, but we already loaded the erroneous data into our platform. Let's talk about restatements and reloads.

Restatements and reloads

In a big data system, it is inevitable that at some point, some data gets corrupted, or is incomplete. The owners of the data fix the problem, then issue a restatement.

A restatement of a dataset is a revision and re-release of a dataset after one or more issues were identified and fixed.

Once data is restated, we need to reload it into our data platform. This is obviously much simpler if we perform a full load for the dataset. In that case, we simply discard the corrupted data we previously loaded and replace it with the restated data.

Things get more complicated if we load this dataset incrementally. In that case, we need to drop only the corrupted slice of the data and reload that from upstream. Let's see how we can do this in Azure Data Explorer .

Azure Data Explorer stores data in extents. An extent is a shard of the data, a piece of a table which contains some of its rows. Extents are immutable - once written, they are never modified. Whenever we ingest data, one or more extents are created. Periodically, Azure Data Explorer merges extents to improve query performance. This is handled by the engine in the background.

The following figure shows how extents are created during ingestion, then merged by Azure Data Explorer.

Extents are created during ingestion, then merged by Azure Data Explorer to improve query performance

While we can't modify an extent, we can drop it. Dropping an extent removes all data stored within. Extents support tagging, which enable us to attach metadata to them. A best practice is to add the drop-by tag to extents on creation. This tag has special meaning for Azure Data Explorer: it will only merge extents with the same drop-by tag. This will ensure that all data ingested into an extent with a drop-by tag is never grouped with data ingested with another drop-by tag.

The following figure shows how we can use this tag to ensure data doesn't get mixed, then we can drop extents with that tag to remove corrupted data.

We ingested 2 extents with drop-by tag 2020-06-29 and 2 extents with drop-by tag 2020-06-30. They get merged into 1 extent with drop-by tag 2020-06-29 and 1 extent with drop-by tag 2020-06-30. We can ask Azure Data Explorer to drop all extents tagged with 2020-06-29 to remove a part of the data.

The drop-by tag ensures that extents with different values for the tag never get merged together, so we don't risk dropping more data than what we want dropped. The value of the tag is arbitrary, we can use anything, but a good practice is to use an ingestion timestamp. So for example when we load data on 2020-06-29, we use the drop-by:2020-06-29 tag.

If we later learn that the data we loaded was corrupted and upstream restates the data, we can drop the extents containing corrupted data and re-ingest from upstream to repair our dataset.

Obviously, this process is more complicated than if we were doing a full load of the data every time. In general, if we can afford a full load, we should use that. Maintenance-wise, it is a much simpler approach. Sometimes though, this is impossible - for example if we want to maintain page view logs beyond the 30-day retention period upstream has, we can't keep reloading the data. Other times, full load is just too expensive: we end up moving the same gigabytes of data again and again, with minor differences. For these situations, we have to look at an incremental load and manage the additional complexity.

Summary

We can ingest data continuously (streaming), or at a certain regular cadence like daily, weekly, monthly, or yearly.
When we ingest data, we can perform either a full load, or we can perform an incremental load.
A full load means we fully refresh the dataset, discarding our current version and replacing it with a new version of the data.
An incremental load means we append data to the dataset. We start with the current version and enhance it with additional data.
It is inevitable for some data to get corrupted. Once repaired upstream, we need a way to discard the corrupted data from our system and reload the updated data.

Recommendations

Tue, 29 Dec 2020 00:00:00 -0800

Recommendations

I sometimes get asked for learning resources and areas aspiring software engineers should focus on. This post will cover some of my recommendations. This is purely for software development, design, and engineering. I will share a complementary list of resources on soft skills, systems, leadership, and working within organizations in a future post.

Fundamentals

Data structures and algorithms -- This is CS 101. Understand lists, stacks, queues, heaps, trees, graphs, and algorithms to sort, select, traverse etc. Understand big-O notation and complexity -- this is important in practice, when implementing solutions that deal with real-world data. This is foundational to the field of computer science, so unless you are working on some cutting-edge stuff, there's probably a well-known solution to your problem. You don't need to know all implementations by heart but know what applies and where to look it up when needed, be it an A* search or B-tree.

Understand your compute target -- this can be a physical machine, an operating system, a virtual machine like the JVM or .NET CLR, the browser, or the cloud. Either way you need to know what resources are available, how they are allocated, what are the performance characteristics and so on. Without understanding your compute target you won't be able to leverage it to its full capabilities and run the risk of misusing it.

Concurrent programming -- today, I believe this is unescapable. Concurrency is everywhere, regardless of whether you are building services that talk to each other, a multi-threaded native application, or a Node.JS, event loop-based application. Having a good mental model of how concurrency works, understanding deadlocks, livelocks, synchronization mechanisms, and consistency models is foundational.

Programming languages

Programming languages are not just how we talk to computers, they are tools for thought. They enable us to express solutions to problems. We model our solutions using the languages, and different languages are best suited to different problems. I'm a firm believer in multi-paradigm and using the best tool for a job. Saying that everything is an object, or everything is a function is an oversimplification. Many modern programming languages support multiple paradigms, so we can write object-oriented code when appropriate, and functional code when appropriate. With that said, I suggest learning:

A system programming language like C++, Rust or Swift if you need to write native, performant code. These languages are close to the machine and will help you understand OS resource management, how code gets executed, and what impacts performance.

A higher-level language for writing tools and services. Something like C#, Java or Go. These are all garbage collected and trade off some performance for productivity.

A dynamic language for quick prototyping and experimentation. Python and Ruby come to mind. While static typing is much safer for production code, I do love thinking in Python.

A purely functional language like Haskell or Idris. This is again to understand different ways of thinking about problems. Even if you don't get to write production code in a purely functional language, you will learn alternative approaches to designing your code which you will be able to apply even when using other languages.

A language with strong support for generics, like C++, Rust, or TypeScript. Generics are a powerful way to reuse and combine code and understanding them will make you a better programmer.

From my personal experience, within the same paradigm, languages are more similar than different. In other words, once you know two, it is significantly easier to understand the third. There is value in learning new programming languages to see how they differ from existing ones and what they bring to the table.

Practice

Write code. Learning needs to be a mix of theory and practice. Here are some practice ideas:

Code katas -- These could be some good first projects when picking up a new language. For example: http://codekata.com/, https://github.com/gamontal/awesome-katas.

Programming puzzles -- These are good ways to practice problem solving, data structures, and algorithms. I really enjoy Advent of Code, which has been running every December since 2015. Facebook Hacker Cup also has some great puzzles.

Contribute to an open-source project, many have supportive communities and paths to get you started.

Work on your own project, be it a website or a game or something else you are passionate about. Scope it to fit the amount of time you must dedicate and use the technologies you want to learn.

Of course, you will do most of the learning on the job. Work on projects that interest you, projects you can learn from, and projects where you can apply what you learned.

Books

This is a list of books on software design and craftsmanship that I highly recommend:

Code Complete -I love this one. This is The Big Book of Software Engineering, and it covers the fundamentals, like writing proper functions, using good naming in the code, testing, debugging etc.

The Pragmatic Programmer - Great book which gives general advice on what it takes to become a good programmer. From basic tips like know thy text editor and use source control to implementation and design advice.

Agile Principles, Patterns and Practices in C# - On writing software in an agile world. Principles to keep in mind (like the Open/Close Principle, the Single Responsibility Principle etc.), patterns to enable them, and agile practices. Also covers TDD, extreme programming, and all the other agile methodologies.

Design Patterns - The classic book on design patterns. It is a bit cumbersome but definitely worth reading. Patterns are basically well-known solutions to recurring design problems. Do read my previous post and don't over-index on patterns.

Refactoring - The classical book on refactoring code - re-structuring implementation without changing functionality. Talks about code smells (pieces of code that feel wrong) and how to rearchitect such code to make it right.

Emergent Design - This book talks about how design emerges as code evolves. By respecting a few principles and knowing about design patterns, you don't have to over-design and future-proof, rather keep refactoring and extending as new requirements come in.

There are a few other books which I really enjoyed, though they are a bit different than the above. I recommend these more for the aesthetics and insights (more on that below): Programming Pearls, From Mathematics to Generic Programming, Beautiful Code.

Other considerations

Develop a sense of aesthetics - This comes with practice. Know what good code looks like, what makes it beautiful. Don't just get code working, try to make it beautiful.

Understand your problem space -- Whatever you are working on, knowing the business domain will help inform your software design. Understand why you are doing what you are doing. Is there a better way to solve the same business problem? Do you know what will likely come up in 6 months from now or a year?

Security -- Software security is critical in today's connected world. You should understand security best practices, which hashing algorithms to use, how to properly store secrets and passwords, how trust gets established, attack vectors, how to create a threat model and so on.

AI - AI is permeating more areas of software. It is also being commoditized through libraries like scikit-learn and services like Azure Cognitive Services. While I won't quite yet put it under fundamentals, I believe using AI will soon be a must-know, much like concurrent programming. Having a good understanding of the types of problems AI can help with, when and how to apply it is very valuable.

Keep learning -- The way we build software keeps evolving. Try to keep up to date with recent developments and trends. This is one of the reasons software engineering is such an exciting field: there's always something new, there's always more to learn.

Notes on Design Patterns

Thu, 10 Dec 2020 00:00:00 -0800

Notes on Design Patterns

Patterns mean I have run out of language. --- Rich Hickey.

Many junior developers want to improve their software design skills by studying design patterns. I was there too, of course. I believe there is a big misconception of what design patterns are, and I believe we are, indeed, over-indexing on them when we are thinking of software design.

It is very easy to over-design things, and if blindly apply design patterns, we end up with code like the FizzBuzz Enterprise Edition, which in real life translate into incomprehensible software that burns hundreds of developer-hours for even tiny changes.

A good design by any other name...

So what are design patterns? A common definition is:

A software design pattern is a general, reusable solution to a commonly occurring problem within a given context in software design.

A lot of design patterns criticism centers around how, using some non-object oriented language, you can express design patterns from the Gang of Four succinctly within the language syntax, without having to code extra scaffolding. This is why I believe there is a misconception around what design patterns really are. My take on design patterns is that they provide good solutions to software design problems, for dimensions like encapsulation, decoupling etc. These patterns are not their representation in any particular language.

In some instances, a language can express the same idea more succinctly. That doesn't mean the pattern is useless. We are still modeling complex domains with code, and we still need to account for all aspects of good design so we don't end up with a jumbled mess. In other words, you can write bad code in any programming language.

This is a recurring topic in my book Programming with Types, where I show alternative implementations to the strategy pattern, the decorator pattern, and the visitor pattern. The first two have more succinct functional implementations, while the last one can be better encapsulated using a discriminated union type. Regardless of how we express the design, we are still solving the same problem. Which is why I believe over-indexing on learning design patterns as code recipes is a mistake.

Smelling software rot

As a young developer wanting to learn design patterns, I stumbled upon the Agile Principles, Patterns, and Practices in C# book, probably because it contained patterns in the title. But this book is a real gem on good design, and I highly recommend it.

Chapter 7 talks about design smells. Design smells are a sign that the software is rotting and might need refactoring. Here are a few examples from the book:

Rigidity - because of poorly-structured dependencies, a small change in one part of the code causes a cascade of subsequent changes in other parts of the code.
Viscosity - when modifying the code, it is easier to add a hack rather than to implement a desing-preserving change. In other words, it is easier to do the wrong thing than it is to do the right thing.
Needles Repetition - copy/pasted code, often with minor modifications, which makes minor updates very difficult to apply across the code base.

There are a bunch more in the book and plenty more documented online.

Developing a nose for design is, in my opinion, a lot more important than knowing patterns. And while you can read about common smells, nothing beats experience (much like reading descriptions of actual smells vs. using your nose).

When I write code, sketching out a solution to some problem, I refactor it several times before I submit a pull request. It is an iterative process - I try something out, notice something off with the design, refactor to improve and simplify.

Instead of bringing a set of prefabricated solutions and checking to see which one fits the problem best, we can focus on refactoring smells away. This avoids the FizzBuzz Enterprise Edition problem. The Enterprise Edition of FizzBuzz is full of patterns! But it is needlessly complex. Accidental complexity is one of the worst smells. I mentioned accidental complexity in the Time and Complexity post and I will probably write more about it since I find this a fascinating topic.

Smells tell us how a design is bad, but what makes a good design?

SOLID principles and beyond

There is a small set of design principles, known as the SOLID design principles which make for good code:

The single-responsibility principle - a class (or program unit) should be responsible for one thing, and have to change only when the requirements of that thing change.
The open/closed principle - code should be open for extension, closed for modification.
The Liskov substitution principle - replacing an instance of some type with an instance of a subtype should maintain program correctness.
The interface segregation principle - single-responsibility principle applied to interfaces.
The dependency-inversion principle - code should depend on abstractions, not concrete implementations.

While a lot of the literature is centered around object-oriented programming, these principles transcend OOP. For example, an interface can be an interface definition, a function signature, a set of APIs exposed by a module etc. Similarly, subtyping does not imply inheritance - check out my Variance post which covers this in more depth.

These SOLID principles are the subject of chapter 8 through 12 of the Agile Principles, Patterns, and Practices in C# book.

Knowing these principles and having a nose for design smells, you can derive any design pattern from first principles. In some cases it's easier if you are aware of the pattern - you don't have to spend time solving an already-solved problem. But it doesn't work the other way around: You can't start with a set of patterns without understanding the underlying principles and without being able to tell when a design smells.

Beyond these principles, exploring well-crafted software helps us develop a sense of good design. In the real-world, most codebases have good and bad parts. I won't go into the details of why in this article, but keep this in mind when working on your project. Which parts are designed well? Why? Which parts could benefit from a refactoring? Why?

Learning design patterns is secondary to understanding good and bad design.

Changing Data Classification Through Processing

Fri, 27 Nov 2020 00:00:00 -0800

Changing Data Classification Through Processing

This is an excerpt from a draft of chapter 10 of my book, Azure Data Engineering, which deals with compliance. In this article we'll look at a few techniques to transform sensitive user data into less sensitive data. In the book, this includes code samples for implementation on Azure Data Explorer, which are omitted from this article. Let's start with a couple of definitions.

User Data

User data is data that can be directly tied to the user. This class of data is important since this is what GDPR covers. User data also comes in a couple of subcategories. One of them is End User Identifiable Information or EUII.

End user identifiable information is information that can directly identify the user. Examples of EUII are name, email address, IP address, or location information.

As data moves through our systems, we generate various IDs. Any data that we have tied to such an ID, for example order history for account ID, is called End User Pseudonymous Information or EUPI.

End user pseudonymous information are IDs used by our systems which, in conjunction with additional information (for example a mapping table) can be used to identify the user.

Note the important distinction: When we look at EUII, for example name and address, we can immediately identify the user. When we look at EUPI, for example account ID, we need an additional mapping to tell us which user the account ID belongs to.

In general, we have one primary ID (or several) which directly identifies the user, which we consider EUII, and multiple other IDs which indirectly identify the user through their connection to the primary ID. These other IDs are EUPI.

Changing Classification Through Processing

In general, we restrict the number of people who can access sensitive data. In most cases, access is on a need-to-know basis, where data scientists and engineers allowed to process this data have done some compliance training and understand the risks and liabilities.

In some scenarios we want to process data so that it becomes less sensitive. A good example is we want to open it up for more data scientists to look at. In this article we will look at a few techniques for achieving this.

Let's start by defining two datasets, as shown in the following figure:

The first dataset, User profiles, contains user accounts, including names, credit cards, and billing addresses. We omitted actual billing addresses to keep things short. This dataset also contains a User ID column which associates an identification number with each user. This is the primary ID in our system, since we can use it to link back to a user's profile information.

The second dataset, User telemetry, contains telemetry data we collect from our users. It contains the user ID, timestamp and product feature the user engaged with.

Let's see some techniques we can use on the User telemetry table to change its classification to something less sensitive.

Aggregation

The first technique is aggregation: we can take user identifiable information from multiple users, aggregate it, and get rid of the end user identifiable part. For example, if we collect telemetry from our users that captures which features of our product they are using, we can aggregate it, so we know how much each feature is being used, but not who used what.

The following figure shows how aggregation transform user identifiable information into data that can't be tied back to individual users.

Before processing, we could see exactly what set of features an individual user was using, which has privacy implications. After aggregation, we can no longer tell that. We still have valuable data -- we know which features of our product are the most used, which ones are not that important to our customers etc. We can store this data for analytics and ML purposes without having to worry about end user privacy.

Data aggregation is the processing of data into a summarized format. This can be used to transform the data so it can no longer be tied to individual users.

But maybe we want to know more: we want to see when each feature is used or how our customers use different features in conjunction. Now simply counting feature usage is not enough. We can use a different technique for that: anonymization.

Anonymization

We can use anonymization to unlink the data from the end user. Once the data is no longer connected to a user identifier, we can't tell which user it came from. Going back to our telemetry example, if we want to know when features are used, but don't care who uses them, we can get rid of the user identifier. The following figure shows how we can anonymize data by dropping the user identifiers.

Maybe this is not enough either. What if we still need to see which features are used together by a user, but we don't really care who the user is? We can still anonymize by replacing user identifiers (which can be tracked back to the user) with randomly generated IDs. The following figure shows how we can anonymize the data by replacing each user ID with a randomly generated GUID.

Note that we are intentionally not persisting a mapping between user IDs and corresponding random IDs. We generate the random IDs once and intentionally forget the association.

Once this happens, we can no longer tie the data back to the user, so it is no longer user identifiable. We can still tie together datasets by the random ID, but there is no way to associate the random ID with a user.

Anonymization is the process of removing user identifiable information from the data or replacing it with data that cannot be linked back to a user.

If we get new telemetry from our users, we won't be able to generate the same GUID when anonymizing. Each time we run the query we will get different random IDs corresponding to our users. In some cases, this might not be enough. Or we might need to maintain that ability to link back to our original user IDs but restrict who can make that association. For these cases, we can use pseudonymization.

Pseudonymization

In this case, we have scenarios for which we still need to know who the data belongs to, but this is not needed for all scenarios. For example, we might want to keep track of which user used which features so we can notify them of updates to that feature. But for other analytics, it is irrelevant who the user is. For the first case, we have a small set of people who can view this association. For analytics, we have a large team of people looking at the data, but from their perspective, it is anonymous.

We can achieve this by pseudonymizing the data. The difference between pseudonymization and anonymization is that pseudonymization gives us a way to reconstruct the relationship.

When we looked at anonymizing data, we swapped out the user ID with a randomly generated ID. Unless we explicitly stored which user ID got assigned which random ID, we can no longer recover the link.

For pseudonymization, we replace random IDs with something more deterministic. This can be either a hash of the user ID, or an encryption of the user ID.

As a reminder, hashing is a one-way function. Give the result of a hash, you cannot un-hash it to get the original value. Encryption is different -- an encrypted value can be decrypted if we know the encryption key.

Pseudonymization is the process of replacing user identifiable information with a pseudonym. The data can be linked back to a user given some additional information.

Let's look at both approaches.

Pseudonymizing by hashing

If we hash the user IDs and provide a dataset with just hashes, the only way to tie this pseudonymous data back to actual users would be to take all the user IDs in our system and hash them to see where we find a match.

If we restrict the access to the user IDs, then someone who can only query the pseudonymized table can still see all the connections within the dataset (which features are used by which user), but instead of seeing a user ID, they see a pseudonymous identifier. The following figure shows the transformation.

Note that if we only have this dataset consisting of Pseudonymous ID, Timestamp, and Feature, we can't produce a user ID. On the other hand, if we have a user ID, we can always hash it and link it to the pseudonymized data.

We can use this technique in cases when the data scientists processing the pseudonymized data don't have access to the unprocessed, end user identifiable data. This way, they get a dataset that is, for all intents and purposes, just like the original, except there is no mention of user IDs.

This doesn't work if the user IDs are also visible since it is easy to hash them again and produced the pseudonymous IDs. One option is to keep the hashing algorithm secret and add a salt. In cryptography, a salt is some additional secret data mixed-in, to make it harder to recreate the connection. For example we can XOR the user ID with some number (our salt).

Now, as long as the salt is kept secret, someone can't get from user ID to the pseudonymous ID even if they know which hashing algorithm is used for pseudonymization.

Let's now look at the alternative to hashing: encryption.

Pseudonymizing by encrypting

If we encrypt the user IDs and provide a dataset with encrypted values, the only way to tie this back to actual users would be to decrypt. As long as the encryption key is secure and only available on a need-to-know basis, people that don't need to know can't recover the association.

This is similar to the hashing technique we just saw, except it is a two-way transformation. Even without having access to a user ID to hash, we can produce a user ID by decrypting an encrypted pseudonymized ID. Figure 6 shows how this would look like.

We will use encryption instead of hashing if we have a scenario in which we don't have the original dataset available, but we need a way to recover it. In this case, we can rely on the two-way transformation provided by encryption and restore the original dataset by decrypting the pseudonymized dataset.

An alternative to transforming data is masking.

Masking

Masking means hiding parts of the data from whoever access it, even if the data is fully available in our system. Think of how social security numbers are reduced to the last 4 digits: ***-**-1234.

Masking sensitive data makes it less sensitive -- obviously, even with bad intent, someone can't do much with just the last 4 digits of a social security number, with just the city and state of a home address, or with the first few digits of a phone number.

Masking the data does require an additional layer in between the raw storage and people querying the data, which determines who gets to see the unmasked, full dataset, and who is restricted to a more limited view of the data. The following figure shows how masking looks like for our User profile table.

Unlike our previous techniques, which transformed the data, this happens in-place. We still have the full credit card number stored, but not everyone querying the table will be able to see it.

Masking leverages an additional layer between the raw data and query issuers to hide sensitive information from non-privileged access.

The good news is many storage solutions and database engines offer such a layer out-of-the-box (see Azure Data Explorer's row level security for example).

Summary

In this article we looked at a few ways in which we can take sensitive data and make it less sensitive:

Aggregating data makes it impossible to connect it back to individual users.
Anonymizing data, while a bit more involved than aggregating, preserves the granularity of user-level data, while removing the identifiable parts.
In some cases, we do have legitimate scenarios in which we want to trace back the data to actual users. In this case, we can use pseudonymization to make the data partially anonymous and only restore the link with the real user ID on a need-to-know basis.
Hashing is a one-way transformation of the data. Given a pseudonymized ID, we can't recover a user ID. We can restore the association by hashing user ID again and joining on the pseudonymized ID. Adding secret salt to a hash makes it harder to restore the association (one would need to also know the salt value).
Encryption is a two-way transformation, which requires an additional piece of information: a key. Given a pseudonymized ID, we can recover the user ID if we have the key by decrypting the data.
Masking is another technique for hiding sensitive information. In this case, the data is not transformed, rather an in-between layer can hide sensitive information and only make it available when appropriate.

This are important techniques to know when dealing with sensitive data, since they all allow us to make more data available to more analytical scenarios without compromising on user privacy.

Data Quality Testing Patterns

Fri, 13 Nov 2020 00:00:00 -0800

Data Quality Testing Patterns

The insights generated by a data platform are only as good as the quality of the underlying data. I briefly mentioned data quality in my Notes on Data Engineering blog post and elaborated on the DataCop solution in the Partnering for data quality Medium article I co-authored with my colleagues.

In this post I want to talk about some of the requirements and common patterns of data quality test solutions. I maintain that in the future data quality testing will be commoditized and offered as a service by cloud providers. As of today, it is still something we have to stitch together or onboard a 3rd party solution. Below is a blueprint for such solutions.

Data Fabrics

Any data quality test run is eventually translated into a query executing on the underlying data fabric. In Azure, this can be, for example, Azure SQL, Azure Data Explorer, Data Lake Storage etc. Data quality frameworks need to support multiple data fabrics for several reasons. First, in a big enough enterprise, data will live in multiple systems. Second, different data fabrics are optimized for different workloads, so even if we are very strict about the number of technology choices, sometimes it simply makes sense to leverage different data fabrics for different workloads.

We probably don't want to manage multiple data quality solutions specializing in different data fabrics, since data quality solutions become part of our infrastructure. Regardless of target data fabrics, any data quality solution needs to implement some form of test scheduling, alerting etc.

The best pattern is to support multiple data fabrics through a plug-in model, so support for additional data fabrics can be easily added by implementing a new plug-in, while the core of the system is shared for all data fabrics plugged into the system.

Types of Tests

The 6 Dimensions of Data Quality article talks about completeness, consistency, conformity, accuracy, integrity, and timeliness. I think of data quality testing like unit testing for code, so I will present a slightly different take on this. Some of the dimensions are harder to test for than others, for example checking consistency across multiple data systems is non-trivial (more equivalent to an integration test). Some of the dimensions have nuances -- for example detecting anomalies in day over day data volume change to identify potential issues vs. just looking at a point in time.

Availability

The simplest type of data test is availability, meaning is data available for a certain date. If, for example, we ingest yesterday's telemetry data every night, we expect to have yesterday's telemetry data available in our system tomorrow.

This type of test can be implemented as a query against the underlying data fabric that ensures some data is there. That can mean a query that returns at least one row passes the test. This can be an early smoke test we can run before performing more involved testing.

Correctness

This type of test ensures data is correct, based on some criteria. For example, ensuring columns that shouldn't be empty are not empty, values are within expected ranges, and so on.

This type of test can be implemented as a query against the underlying data fabric that ensures no out-of-bounds values are returned. Going back to the unit test equivalency, we likely want multiple correctness tests per dataset, one or more per column we want to test.

Completeness

Completeness tests expand on availability tests and validate not only that some data is available, but that all the data is available.

In some cases, this can be an exact query: for example, if we are expecting some data for all 50 states of US, we can check that the count of distinct states is 50.

More advanced checks look at historical data and ensure the volume of the data we are checking is within a certain threshold -- for example, our telemetry data volume should be within +/-5% of the data volume we observed yesterday.

Anomaly detection

We won't go to deep into this, but there are some complexities associated with the previous type of historical data checks. For example, website traffic volume, depending on the website, might be very different day over day between weekends and workdays, or during holidays etc.

For these situations we can use more complex AI-based anomaly detection to track volume over time until the system can identify anomalous data.

Queries, Code or Configuration

The implementation of tests can be done as queries, code, configuration, or a mix.

Implementing all tests as queries means writing each test as a stored procedure (or equivalent), so it is fully implemented on the data fabric it executes against. The test framework just invokes the test and reads the result. The main challenge with this is that there is not a lot of reuse. Since the framework ends up just calling a stored procedure, it is up to the test author to write the test, which is ultimately an arbitrary query.

Implementing all tests as code means wrapping each data quality test into a test method in some programming language. The main advantage of this is we can use an off-the-shelf test framework with all its benefits. Unlike data quality testing, the field of software testing is mature. On the other hand, the main drawback is that authoring tests as code really raises a barrier to entry, making it harder for non-engineers to create tests.

Tests as configuration means defining tests using text configuration files in a format like JSON, YAML, or XML. The framework interprets these and translates them into the final queries to execute against the data fabric. The main advantage of this approach is a fairly data fabric-independent way to specify tests, around which we can build schema validation etc. The disadvantage is increased complexity in the framework, as it must translate configuration to queries.

What works best, from my experience, is a mix of queries and configuration: give enough flexibility in the config schema to author custom queries and let the framework handle some of the common concerns like scheduling.

Test Execution

Another important concern is when we want to run a given data quality test.

In some scenarios we want to run tests as part of a data movement workflow. We want to either check that the data is in good shape at the source, before we copy it, or check that the data is in good shape after it got copied to the destination. Let's call this type of execution on demand. This type of test execution is integrated in the ETL pipeline.

In other scenarios we want to run the tests on a schedule, as we expect the data to be available. For example, if data should have arrived in our SQL database by 7 AM every morning, we want to run our data quality tests at 7 AM to make sure everything is in good shape. In this case, we don't really care how the data gets here, so we are running independent of the ETL pipeline that brings the data in. Let's call this scheduled execution.

A good data quality test framework should support both on demand and scheduled test execution, as there is room (and need) for both types of tests.

Monitoring and Alerting

A data quality framework must be integrated with an alerting system such that whenever a data quality test fails, data engineers are proactively alerted so they can start investigating/mitigating the data issue.

A dashboard showing the overall health of the data platform is another important component. This should include data quality for all datasets under test, so stakeholders suspecting a data issue can quickly check to see when did the latest tests run for a given dataset and what were the results.

A slightly more advanced capability is lineage tracking -- if we identify an issue with a dataset, do we know what is the impact? What analytics or machine learning models or downstream processing is impacted? Tracking metadata like lineage is a deep topic itself but integrating this with a data quality framework enables very powerful observability.

Summary

In this post we covered some patterns for data quality testing:

Supporting multiple data fabrics via a plug-in model.
Types of tests, from simple availability to anomaly detection.
How to best specify tests as a mix of queries and configuration.
Test execution, both on a schedule and on-demand.
Monitoring and alerting to bubble up data quality issues.

A data quality test framework is a critical part of a data platform, giving confidence in the results produced by all workloads running on the platform.

Azure Data Engineering

Thu, 08 Oct 2020 00:00:00 -0700

Azure Data Engineering

I am happy to announce that my new book, Azure Data Engineering, launched in Manning Early Access Preview (MEAP). While still a work in progress, the first chapters are available online. As I keep working on the book and polishing the draft, more chapters will be added, and existing chapters will be updated. That being said, the preview is now live.

For the past few years, I had the opportunity to work as the architect for Azure's growth team. Ron Sielinski, our director of data science, describes how our team uses Azure to understand Azure in this great article on our Data Science @ Microsoft Medium publication.

Our engineering team maintains a big data platform, built fully on Azure, which supports all our team's workloads. After we launched our Medium publication, I contributed a bunch of articles describing some of our infrastructure, challenges, and solutions. I talked about how we use Azure Data Explorer, how we enabled self-serve analytics, how we scaled out our ML platform, common challenges I noticed across the industry, and data quality.

There are plenty of resources out there covering statistics, data science, and machine learning, but comparatively little covering the engineering aspects of working with big data. This book is what I wish I had available to read when joining the team, to help navigate this complex space and lessons I had to learn the hard way.

The engineering in data engineering

While many data science projects start as exploratory, once they show real value, they need to be supported in an ongoing, reliable fashion. In the software engineering world, this is the equivalent of taking a research, proof of concept, or hackathon project and graduating it into a fully production-ready solution. While a hack or a prototype can cut many shortcuts and focus on the meat of the problem it addresses, a production-ready system does not cut any corners. This is where the engineering part of software engineering comes into play: the engineering rigor to build and run a reliable system. This includes a plethora of concerns like architecture and design, performance, security, accessibility, telemetry, debuggability, extensibility and so on.

Data engineering is the part of data science dealing with the practical applications of collecting and analyzing data. It aims to bring engineering rigor to the process of building and supporting reliable data systems.

Data engineering is surprisingly similar to software engineering and frustratingly different. While we can leverage a lot of the learnings from the software engineering world, as we will see in this book, there is a unique set of challenges we will have to address. Some of the common themes are making sure everything is tracked in source control, automatic deployments, monitoring and alerting. A key difference between data and code is that code is static: once the bugs are worked out, a piece of code is expected to consistently work reliably. On the other hand, data moves continuously into and out of a data platform and it is very likely for failures to occur due to various external reasons. Governance is another major topic which is specific to data: access control, cataloguing, privacy, and regulatory concerns are a big part of a data platform.

The main theme of the book is bringing some of the lessons learned from data engineering over the past few decades to the data space, so you can build a data platform exhibiting the properties of a solid software solution: scale, reliability, security, and so on.

Anatomy of a big data platform

A big data platform ingests data from multiple sources into a storage layer. Data is consumed from the storage layer to enable various workloads (data modeling, analytics, machine learning). Data is then distributed downstream to consumers. All the activity in a data platform needs to be orchestrated by an orchestration layer. Governance is extremely important. And, of course, DevOps is the key: deploying everything from source control.

The book is divided in 3 parts, each part looking at a big data platform through a different lens:

Part 1 of the book will focus on infrastructure, the core services of a data platform.
- We will start with storage, the backbone of any data platform. Chapter 2 will cover the requirements and common patterns for storing data in a data platform.
- Since our focus is on production systems, in chapter 3 we'll discuss DevOps and what DevOps means for data.
- Data is ingested into the system from multiple sources. Data flows into and out of the platform and various workflows are executed. All of this needs an orchestration layer to keep things running. We will talk about orchestration in chapter 4.
Part 2 will focus on the 3 main workloads that a data platform must support:
- Modeling: this includes aggregating and reshaping the data, standardizing schema, and any other processing of the raw input data. This makes the data easier to consume by the other two main processes: analytics and machine learning. We will talk about data modeling in chapter 5.
- Analytics: this covers all analysis and reporting on the data, deriving knowledge and insights. We will look at ways to support this in production in chapter 6.
- Machine learning: these are all machine learning models training on the data. We cover running ML at scale in chapter 7.
Part 3, governance, is a major topic with many aspects. We will cover governance in chapters 8, 9, and 10, touching on the key topics:
- Metadata: cataloguing and inventorying the data, tracking lineage, definitions and documentation is the subject of chapter 8.
- Data quality: how to test data and asses its quality is the topic of chapter 9.
- Compliance: honoring complying requirements like the General Data Protection Regulation (GDPR), handling sensitive data, and controlling access is covered in chapter 10.
- After all the processing steps, data eventually leaves the platform to be consumed by other systems. We will cover the various patterns for distributing data in chapter 11.

The examples in the book are built on Azure, using a specific set of technologies, but the patterns should apply regardless of specific tech choices or even cloud providers. Check out the book here and follow me on LinkedIn or Twitter for updates.

Also posted on Medium.

Machine Learning at Scale

Mon, 27 Apr 2020 00:00:00 -0700

Machine Learning at Scale

This is a cross-post of the article I wrote for Data Science @ Microsoft, Running machine learning at scale.

Our team runs dozens of production machine learning models on a daily, weekly, and monthly basis. We recently went through a redesign of our ML infrastructure to increase its abilities to enable self-serve, scale to match computing needs, reduce impacts among models running on the same VM, and remove differences between dev and production environments. In this post, I will describe the challenges we faced with the previous infrastructure and how we addressed them with our Version 2 architecture.

Version 1

Our machine learning engineers use Python and R to implement models. Our Version 1 infrastructure used a custom XML format from which we generated Azure Data Factory (ADF) v1 pipelines to copy the model input data to blob storage. Then the models ran on a set of VMs our team maintained. The models read their input from and wrote their output back to blob storage. The ADF pipelines then copied the outputs to Azure Data Explorer and to our data distribution Data Lake.

The Control VM consumes XML from Git and generates ADF pipelines to orchestrate data movement and run ML code on a set of VMs.

The V1 infrastructure had several challenges we set out to overcome:

No self-serve: Much like how we implemented a self-serve environment for analytics, we wanted to do something similar for machine learning, so our ML engineers can create and deploy models without needing help from the data engineering team.
Auto-scaling: We have some compute-intensive models that run on a certain day of the month when upstream datasets become available. For a few days, we need large compute. Then, until the next month, our compute needs decrease significantly. The V1 infrastructure didn't account for this and we had a constant number of VMs running at all times.
Isolation: We used to pack multiple models on the same VM, so if one of them consumed, for example, too much RAM, it would impact all the other models running on the same VM. We needed better isolation.
Differences between dev and prod environments: One issue we kept hitting involved models that ran fine on the ML engineer's VM but failed when moving to production because of environment differences such as missing packages.

The combination of these issues created significant operational costs for the data engineering team: VM management, scaling issues, and having to re-run models that failed because of either resource constraints or bugs caused by missing packages in the production environment. As our machine learning engineers develop more and more models, we decided to invest in making our infrastructure more robust, scalable, and self-serve.

Version 2

Our Version 2 infrastructure aims to address all these issues and provide a scalable, self-serve platform for machine learning. Built fully on Azure, it is made up of the following components, which we'll discuss in turn:

Orchestration.
Storage and distribution.
Compute.
DevOps.
Monitoring and alerting.

ADF pipelines deployed from Git orchestrate data movement and running ML code, also deployed from Git, on Azure Machine Learning. Data is distributed through ADLS. The system is monitored using Azure Monitor.

Orchestration

For orchestration, we use Azure Data Factory V2. In contrast to our V1 infrastructure, we don't use XML to generate pipelines, rather we provide a set of templates that ML engineers can use to author the pipelines themselves. We have a dev ADF and a production one.

In general, an end-to-end machine learning pipeline has three steps: Move inputs, kick off compute, and move outputs. Templates make it easy to create and configure a pipeline.

We use CI/CD for ADF: The dev ADF instance is synced with Git, and so an ML engineer can submit a pull request for review. Once approved and merged to master, ADF generates the ARM template we use to deploy the production ADF instance.

The two data factories are similar, except that they are connecting to different environments: Dev to the development storage and compute, and production to the production storage and compute, which is locked down. This addresses one of the limitations of our V1, as we now have similar dev and production environments and so graduating a model from dev to production is much more seamless.

Storage and distribution

For storage, we switched from blob to Azure Data Lake Storage (ADLS) gen2. ADLS gen2 is backed by blob storage, with a couple of important additional features: A hierarchical file system and granular access control. Both are key to our infrastructure.

The hierarchical file system allows us to create a common folder structure for our models, with separate folders for inputs and outputs for each run. The granular access control allows us to enforce who gets to see what data. This is an important aspect of our platform, since some models are trained on sensitive information such as Microsoft revenue that not everyone can view.

Because we are already distributing data through ADLS, we can skip a copy step: Instead of moving the model output to our data distribution Data Lake, we can share it in place, applying proper access control. The less data moves around, the less opportunity for issues in our system and the less network bandwidth our platform needs to use.

Compute

Compute is our biggest upgrade from V1: Instead of maintaining VMs, we switched to using Azure Machine Learning (AML). This is an important switch from IaaS to PaaS, where a lot of the infrastructure we spent time maintaining is now provisioned and handled by AML.

AML addresses two of the main problems we set out to solve: Auto-scaling and isolation for our models. We can run each model on dedicated compute and then AML takes care of spinning up the resources required for a run, winding them down once the run is over.

Because the configuration for compute is done via code, the dev and production environments are identical, meaning we don't run into any issues when graduating a model to production. We can also select the size of compute we want via this configuration, and so a model that is more resource intensive can be configured to run with more RAM and/or more CPU power. AML also gives us statistics on CPU and memory usage, which helps us right-size compute for each model.

DevOps

Both ADF and AML are synced with Git and deployed via two Azure DevOps release pipelines. One of them updates the production ADF instance, the other updates the AML production instance. We split the two because updating model code doesn't require any updates to the orchestration. This means, for example, that for a model bug fix it is enough to deploy to AML without touching the Data Factory.

Having everything in Git enables self-serve, and brings in the required engineering rigor: Changes are done through pull requests, we have a code review process, we don't make manual changes in the production environment, we have a history of all the changes, and we can rebuild an environment from scratch if needed.

Monitoring and alerting

We are running a production system, and so monitoring and alerting are key components. For monitoring, we use Azure Monitor/Log Analytics. ADF orchestrates all our model runs and it integrates natively with Azure Monitor, where we can define alerts for pipeline failures.

For alerting, we use IcM, the Microsoft-wide incident tracking system. Pipeline failures generate incident tickets, which alert engineering of live site issues, like all Azure production services. We are also providing a Power BI dashboard where stakeholders can see the status of all models.

Monitoring and alerting help us maintain our service-level agreements and operational excellence.

Summary

In this article we looked at our Version 2 machine learning infrastructure, going over its key components:

We use Azure Data Factory to orchestrate all data movement and model runs.
We use Azure Data Lake Storage to store both model inputs and outputs, which allows us to implement granular access control and easily distribute the data to teams downstream.
We use Azure Machine Learning for compute, which enables auto-scaling and isolation for model runs.
We use Azure DevOps to deploy from Git, which enables self-serve and reproducibility.
We use Azure Monitor for production environment monitoring and alerting.

This cloud-native architecture allows us to reliably run ML at scale with a self-serve environment for our machine learning team, increasing their productivity while decreasing the resources we need to spend.

Azure Data Explorer

Sun, 01 Mar 2020 00:00:00 -0800

Azure Data Explorer

This is a cross-post of the article I wrote for Data Science @ Microsoft, Azure Data Explorer at the Azure business scale.

As I mentioned in my previous post, Self-Serve Analytics, our team uses Azure Data Explorer (ADX) as our main data store. In this post, I will delve deeply into how we use ADX.

Use Case

The ADX documentation describes it as a fast, fully managed data analytics service for real-time analysis on large volumes of data streaming from applications, websites, IoT devices, and more. ADX streams terabytes of data and enables real-time analytics to be performed on it. In many cases, this ADX-enabled data is used in the context of ingesting and analyzing telemetry from various services or endpoints. Our use case for ADX is different: We use it as the main data store for our big data platform.

We rely on the ingestion capabilities of ADX to pull in terabytes of data pertaining to the Azure business from various sources. Our data science team then leverages the fast query capabilities offered by ADX to explore the data in real time and perform modeling work, which leads to a better understanding of our customers.

Some of the data points our team uses are already stored in ADX when we access them, in different clusters managed by other teams. We use the cross-cluster query capabilities provided by ADX to join these external data sets with local data.

Our engineering team also relies on the same fast query capabilities of ADX to power some of our web APIs.

Ingestion

We have a well-defined process to bring new data sets into our cluster. First, we take a one-time snapshot of a potential data set and store it in a separate cluster (our acquisition cluster), where access is even further restricted to the small set of individuals tasked with exploring this new data set. This initial exploration gives us a good sense of which parts of the data are useful for ingesting on a regular cadence, and what our data model should look like. We can then create a data contract with the upstream team to define SLAs and start automating the data pull.

All data movement is set up in Azure Data Factory and actively monitored.

DevOps and Analytics

As I mentioned in my previous post on our data environment, we use the Azure DevOps ADX Task to deploy objects from git. Tables are set up using a .create-or-merge table script while functions are set up using a .create-or-alter function script. Both commands are idempotent so we can replay them even if objects already exist.

As a team, we've standardized on ADX functions for analytics, so all the reports, KPIs, and metrics our team produces end up implemented as functions stored in git and deployed using Azure DevOps. The ability to organize objects in a folder structure helps us group them by focus area.

Customer Model

Not only do we ingest large amounts of data into our main ADX cluster, we are also processing and enhancing it to build what we call the customer model.

The customer model consists of three components:

A keyring, which helps us tie together various identifiers used across the business, enabling us to understand, for example, which company a subscription belongs to.
A set of customer properties, which you can think of as key-value pairs attached to an identity in our system.
An activity model, which represents a timeline view of various relevant events for an identity in our system. For example, for a subscription identifier we have events such as created and closed.

We use a set of Logic Apps and CosmosDB to process and enhance raw data into our customer model, which consists of a keyring, customer properties, and an activity model.

The customer model is continuously updated as we ingest new data points and represents an enhanced view of the raw data. It is implemented as a small set of (very large) tables and multiple functions to improve navigation. The expressive ADX function syntax allows us to create functions that can be combined to produce very complex queries of the data model.

The workflow of building the model is orchestrated by Logic Apps, which run ADX functions to join and enhance the raw data. The keyring is an exception: We build it using CosmosDB, namely the Gremlin API, which can perform graph traversal. We load all identifiers as vertices and known connections as edges, and then we group each connected component of the graph into a key group. This gives us the association across all identities within our system. The output is written back to ADX.

We consume the customer model through ADX functions. As an example, the GetRelatedKeysByType() function takes as arguments an identifier value and an identifier type name and returns all identifiers related to it from the keyring. We can pass the result of this call to the GetActivities() function, which also takes as arguments a startDate datetime and an endDate datetime, to get all activities for the given ID group within that time range.

Different activities are described by different properties. For example, a subscription created activity contains, among other things, an Offer ID, an Offer Type, and a flag indicating whether the subscription was created as a trial. As another example, a daily usage activity contains the name of the sold service and consumption units. We use the ADX pack() function to store these properties as dynamic objects in the underlying data model, allowing us to maintain a standard schema.

Compliance

Because we store some high business impact data sets, such billing data for Azure services, we must govern who can see different parts of the data. We set role-base access control (RBAC) at the database level, so we can place sensitive data sets in dedicated databases.

We can also mark tables as restricted, which limits users to those with the UnrestrictedViewer role. In ADX, a Viewer role can view any table in a database except those marked as restricted. The UnrestrictedViewer role can view any table in a database regardless of whether it is restricted or not. The ADX team is also working on enabling table-level access control, which will allow even more granular RBAC assignments.

We are also leveraging ADX retention policies to ensure data doesn't stick around forever. In some cases, this is a requirement of the Microsoft data handling standards that are mandatory across the company. In other cases, we ensure prototypes and proofs-of-concept are cleaned up so they don't make their way into our production boundary. I detailed this in my previous post, where I discussed how we move analytics from the prototype Scratch database (with its 30-day retention policy) to WorkArea and then to Production.

Scaling Out

As more and more workloads are served by our main ADX cluster, we need to start thinking about performance and scale. We are addressing this in two main ways: With our approach to data distribution and by looking into follow clusters.

Scaling out from a single ADX cluster serving all workloads to multiple follow clusters supporting different workloads and ADLS for low frequency, high volume data movement.

We used to simply grant access to our data in ADX to teams interested in consuming it. The problem with this approach is that external teams might end up running expensive queries against our cluster and disrupt other operations. This happened frequently in the common scenario of bulk data movement of the large data sets our team produces. Because of this, we are no longer granting access to any service principles to ADX. We allow individuals to come in and explore our data sets but when they want to start copying it on a regular cadence, we use a different storage solution: Azure Data Lake Storage (ADLS).

Because our data sets are updated on a daily, weekly, or monthly cadence, we only need to copy them to ADLS once after an update, and then other teams can pick them up from there without having an impact on the performance of our ADX cluster. ADLS provides large scale storage at very low cost, so it is ideal for this scenario.

The other scaling method we are considering is setting up follow clusters. A follower cluster can replicate data from the followed cluster, which would enable us to offload some workloads to separate compute. By default, everything is followed, which is redundant for the amount of data we have, but a follower can be configured to mirror only a subset of the followed data. We can do this by starting with a caching policy of 0 (which prevents any data replication), and then selectively overwrite it for the databases and tables we want to replicate.

Summary

In this post, I've discussed our team's use of Azure Data Explorer:

Many of our scenarios involve data exploration. That activity, combined with the large data ingestion and cross-cluster capabilities of ADX, makes ADX a great data store solution.
We bring data into our cluster via a clearly defined process so that data loads can be consistently performed and monitored.
We use DevOps to deploy objects to production from git.
We enhance our raw data with a Customer Model, a curated data set consisting of three major pieces: A keyring, a set of customer properties, and an activity model. We use ADX functions as an interface to this data set.
For compliance, we place data in different databases depending on its classification, and we have granular access control for each database.
Scaling out, we offload large copy jobs to Azure Data Lake Storage, and we can create follow clusters to partition the compute load.

Self-Serve Analytics

Sat, 01 Feb 2020 00:00:00 -0800

Self-Serve Analytics

This is a cross-post of the article I wrote for Data Science @ Microsoft, How we built self-serve data environment tools with Azure. Many thanks to my colleague Casey Doyle for editing this into good shape.

Our team not only helps the engineering org build Azure - we use it, too, in our data science work. Our team consists of program managers, data scientists, and data engineers. In this post I describe how our data engineering team developed a scalable, self-serve analytics platform for our data science organization.

We maintain a big data platform with all the required data points to view and understand the Azure business. While we work with multiple data fabrics (such as Azure SQL and Azure Data Lake Storage), our main storage solution is Azure Data Explorer (ADX). We use ADX for several reasons, but two key ones involve the scale of the data we are dealing with and the exploratory nature of the work our data scientists are doing. ADX excels at quickly running queries across huge volumes of data.

Reproducibility

When we were a small team, data scientists produced ad hoc reports and analyses, building queries and running them from their own machines. This worked for a while, but soon we hit issues of reproducibility: If a set of queries exists only on one person's machine and that person goes on vacation, nobody else on the team can reproduce their work.

Of course, this is a well known issue in the software engineering world, with an industry-standard solution: Source control. As a first step, we asked everyone on the team to use Git to store their ADX scripts. This not only enabled capturing canonical queries in a public repository, it also allowed us to adopt other good practices such as mandatory code reviews.

Bringing engineering best practices to analytics was the first step toward a reliable analytics platform.

Environments

Another problem we ran into was around the interdependencies among various data sets. Some reports or metrics took dependencies on the output of other metrics. But consider a situation in which the dependent metric consists of one-off exploratory output, while the other is a monthly report. Without a systematic way of keeping track of what depends on what, things start to look like Jenga - maybe a critical piece of an important artifact disappears because the original author didn't realize anyone cared about it.

To solve this problem, we split our ADX cluster into three different environments: Scratch, Work Area, and Production.

Changes flow from Scratch to Work Area to Production. Production is read-only for everyone except engineering.

Scratch is an area open to everyone on the team to do anything they want, with one important rule: No production dependencies are allowed. Scratch is used for prototyping, proof-of-concepts, and other exploratory work. To enforce this, we set a 30-day retention policy. This ensures that nothing beyond prototypes exists there. Only our team has access to this area.

Work Area is the place data scientists use once they are done prototyping and have a good idea of what they need to do next. They still have full access to Work Area but unlike Scratch, data scientists can share work-in-progress with external stakeholders for user acceptance testing. If the artifact is a one-time analysis, it stops here. If it is recurring, for example a monthly report, it graduates to Production.

Production is a locked down environment and only a few data engineers have Write access to it. This is the equivalent of a production services environment, where access is restricted such that nobody can accidentally cause an impact to a live application. In our case, nobody can accidentally modify a recurring report or key metric others depend on.

Moving work from Scratch to Work Area to Production ensures dependencies can only flow in one direction (Scratch can depend on something in Production, but not vice versa). Quality gates like mandatory code reviews ensure that whatever makes it to Production meets a high quality bar.

We also created explicit guidelines for what should go into Production: Queries should be packaged into functions, tables and functions should be created with .create-or-alter, and so on.

Self-Serve

Because the Production environment is restricted to data engineers, graduating something to Production first involved a hand-off process: A data scientist would have to ask the data engineering team to create or update an entity in the Production environment. This usually meant creating a work item that had to be prioritized against others.

Our data engineering team is much smaller than our data science organization (we have about one engineer for every five data scientists), and so this approach didn't scale very well. To optimize this, we invested in enabling a self-serve model.

We created an Azure DevOps release pipeline that uses the ADX Task to execute ADX scripts in the git repo against the Production database. We release from the master branch, but in order for a new script to make it to master, it must be submitted as a pull request and code review signoff is required from a member of the engineering team.

With this model, data scientists send a pull request and, once reviewed by a maintainer of the Production environment, it is automatically merged and deployed. In contrast to the original hand-off process (via work items), engineering involvement can be as simple as approving a pull request. The code review process still ensures that the engineers who operate the Production environment have a say in what makes it into that environment and can always request changes to scripts if they believe, for example, a query can be further optimized.

Data Movement

Similar to the analytics needs described above, we needed to orchestrate data movement. This includes copying data from external sources into our ADX cluster and running functions on a schedule to produce reports.

Production data movement was originally operated by the engineering team, and so requests again used to come in the form of work items to be prioritized and scheduled. But because our data scientists are familiar with Azure Data Factory (ADF) and use it for data movement in the Work Area environment, we realized we could enable self-serve capabilities for data movement too, by leveraging the ADF continuous integration and delivery setup. This way, scientists can create ADF pipelines and submit pull requests that engineers deploy to production.

We are using Azure Monitor integrated with IcM, the company-wide incident management system, to monitor the production pipelines. Azure Monitor integrates natively with ADF. Our support model entails the engineering team looking at tickets generated by pipeline failures, which are then restarted to recover from transient issues. In case of persistent failures, we involve the original pipeline authors to help debug. This way, the engineering team operates all Production data movement without having full ownership of the pipelines.

We deploy both Kusto objects and Data Factory pipelines from ADO git using ADO pipelines. The Production environment is operated by engineering.

Data Contracts

But there is still more to it than just deploying and monitoring ADF pipelines: Bringing external data into the production area of our cluster means taking an upstream dependency on that data. There are several properties to be defined for such a dependency, including things like the service level agreement (SLA) with the upstream provider and the data classification, which ensures our system stays compliant. The topic of maintaining compliance in a big data platform deserves an article by itself, so I won't go into details in this one. The key point is that in this case, self-serve is not enough - we also need data contracts.

A data contract specifies all the details of a dependency, whether upstream or downstream of our platform. As part of the quality gates for pull requests, we thoroughly review proposed pipelines and in general we don't allow new connectors to be added with this model. A new connector implies a new dependency, so before deploying to production we need to ensure we have a contract in place and that the connector uses the production service principles the engineering team is maintaining.

Summary

In this article we reviewed several of the infrastructure and processes built by our data engineering team:

We brought engineering rigor to our analytics through source control, code reviews, and automated deployment.
We enabled self-serve by leveraging Azure DevOps.
We similarly enabled self-serve for data movement using ADF CI/CD.
We have processes in place to ensure our data is compliant and that production dependencies are properly documented.

More importantly, we achieved all of the above without maintaining any custom application code: Our entire solution is built on Azure PaaS services, which frees up our engineering team to tackle other challenges, ones we will discuss in future articles.

Time and Complexity

Sun, 19 Jan 2020 00:00:00 -0800

Time and Complexity

2020 being a leap year, it's a good opportunity to talk about how we track time. We'll start with that, but this post is as a reflection on the inherent complexity of the physical world and human societies.

There is a famous blog post, Falsehoods programmers believe about time, which covers some well known pitfalls like years have 365 days or February is always 28 days long. Unfortunately, a lot of software has such assumptions hardcoded and things go awry.

But before talking about software, let's step back and look at how we measure time.

Atomic Clocks

Atomic clocks provide an extremely precise measure of the passage of time. These clock don't gain or lose a single second over hundreds of millions of years. These devices are beautiful: they can provide a monotonic count of the passage of time with an incredible precision.

In fact, we have the International Atomic Time standard, or TAI, which is defined, according to Wikipedia, by the weighted average of 400 atomic clocks from over 50 laboratories across the world. This is an extremely precise measure of time passing on Earth.

It is also not practical enough to be used as the basis of our calendars. Even with a very accurate, high-resolution, atomic clock, we still need to account for the fact that the Earth orbits around the Sun (so we get seasons), and spins around its axis (so we get day and night). Let's see why there isn't a simple mathematical function from atomic clock tick to year-month-day-hour-minute-second.

Leap Years

Leap years were introduced to account for the fact that Earth's orbit around the Sun is slightly longer than exactly 365 days. Without adjusting for this fact, the seasons would gradually shift around the calendar. But a leap year doesn't necessarily occur every 4 years. Turns out Earth's orbit is slightly smaller than 365 days and 6 hours, so adding one day every 4 years would cause seasons to drift the opposite direction. The leap year rule is actually

A leap year is every year divisible by 4, except for years divisible by 100, unless they are also divisible by 400.

So years like 2020, 2024, 2028 and so on are leap years. But years like 1700, 1800, 1900, are not leap years, because they are divisible by 100. Except 1600, and 2000, which are not only divisible by 100, but also by 400.

We started with a simple model of a precise, monotonic atomic clock measuring ticks, but when get to user-friendly time, we end up with complex business rules that aim to account for the physical world. But it gets more complicated.

Leap Seconds and Standards

If leap years were all there is to it, we could've easily mapped an atomic clock tick to a precise date time value. But it gets more complicated. Turns out Earth's rotation is not constant - it is irregular, and trends towards slow down. Major earthquakes can affect the momentum of the rotation. Tidal interaction with the moon is also slowing down the speed of rotation over millions of years.

UT1, the Universal Time standard based on Earth's rotation, is drifting from the UTC, the Coordinated Universal Time, which uses atomic clocks to measure time. Because of this, UTC had to introduce leap seconds. A leap second aims to bring UTC time (based on atomic clock measurements) back in sync with UT1 time (time as observed astronomically), so they are not more than 1 second apart.

A leap second adds one second to a day, so we end up with a 61 seconds-long minute. Leap seconds are usually added at the end of the month, UTC time. For example, on June 30th 2015, the UTC time was, at some point, 23:59:60. This effectively makes a day 1 second longer.

There is no formula for this: as we measure time both astronomically and atomically, a standards body decides when a leap second is introduced and notifies the world every 6 months. In fact, the standard UTC time we use is, as of the time of this writing, 37 seconds behind the TAI.

Other Requirements

Besides the physical realities of measuring time atomically and astronomically, we have multiple other requirements.

We have daylight saving time, which moves the clocks forward 1 hour in spring and 1 hour backward in the fall. This creates a 23 hour-long day in the spring and a 25 hour-long day in the fall (falsehood programmers believe about time: all days have 24 hours). This is also not standard across the world: some countries observe daylight saving while others don't.

We have time zones, which don't neatly divide the earth in 24 equal-width slices, rather are set at geopolitical boundaries. Time zones aren't even necessarily multiples of 1 hour: India is 5 hour and 30 minutes ahead of UTC.

Also note that daylight saving and time zones get updated: Russia recently stopped observing daylight saving while China went from 5 different time zones to a single one, even though its geography hasn't changed.

Inherent Complexity

Even with an exact atomic clock, when taking into account year length and day and night cycles, we have to introduce additional rules to determine the date and time, like leap years and daylight saving. Not only that, international standards bodies determine when leap seconds occur, while countries are free to decide which time zone or time zones they are using.

I believe this is typical of any non-trivial problem space we tackle with software. When we try to model the physical world, things get messy. They get messier with humans in the system: laws, standards, and expectations introduce other arbitrary rules. Software needs to be complex to handle the real world.

Accidental Complexity

The above conclusion might seem to stand against pretty much everything I wrote on this blog, where I try to argue for clean and simple code. Why bother if any real world piece of software is destined to grow complex? The reason is that there is enough complexity inherent in dealing with the real world, without us having to introduce more. We don't need to make things worse than they are. To quote a couple of lines from The Zen of Python:

Simple is better than complex.

Complex is better than complicated.

We try to keep things simple. Sometime simple is not enough, we need complex solutions to complex problems. But at least let's not make them complicated. A clean, well-crafted system, with rules properly encapsulated can still be fairly easy to work with. If developers introduce additional complexity which stems not from the problem domain but from coding practices, then the ability to reason about and maintain the system drops precipitously. This is called accidental complexity.

We should always ask ourselves whether the complexity we are dealing with is inherent or accidental. The former is unavoidable, the latter should be avoided at all cost.

Variance

Fri, 27 Dec 2019 00:00:00 -0800

Variance

This blog post is an excerpt from my book, Programming with Types. The code samples are in TypeScript. If you enjoy the article, you can use the discount code vlri40 for a 40% discount on the book.

Subtyping Relationships

We know that if Triangle extends Shape, then Triangle is a subtype of Shape. Let's try to answer a few trickier questions:

What is the subtyping relationship between the sum types Triangle | Square and Triangle | Square | Circle?
What is the subtyping relationship between an array of triangles (Triangle[]) and an array of shapes (Shape[])?
What is the subtyping relationship between a generic data structure like List, for List and List?
What about the function types () => Shape and () => Triangle?
Conversely, what about the function type (argument: Shape) => void and the function type (argument: Triangle) => void?

These are important questions, as they tell us which of these types can be substituted for their subtypes. Whenever we see a function that expects an argument of one of these types, we should understand whether we can provide a subtype instead.

The challenge in the above examples is that things aren't as straightforward as Triangle extends Shape. We are looking at types which are defined based on Triangle and Shape. Triangle and Shape are either part of the sum types, or the types of elements of a collection, or a function's argument types or return types.

Subtyping and Sum Types

Let's take the simplest example first, the sum type. Let's say we have a draw() function which can draw a Triangle, a Square, or a Circle. Can we pass a Triangle or Square to it? As you might have guessed, the answer is yes. We can check that such code compiles:

declare const TriangleType: unique symbol; 
class Triangle {
    [TriangleType]: void;
    /* Triangle members */
}

declare const SquareType: unique symbol;
class Square {
    [SquareType]: void;
    /* Square members */
}

declare const CircleType: unique symbol;
class Circle {
    [CircleType]: void;
    /* Circle members */
}

declare function makeShape(): Triangle | Square;
declare function draw(shape: Triangle | Square | Circle): void;

draw(makeShape());

makeShape() returns a Triangle or a Square while draw() accepts a Triangle, a Square or a Circle (implementations omitted).

We enforce nominal subtyping throughout these examples since we're not providing full implementations for these types. In practice, they would have various different properties and methods to distinguish them. We simulate that with unique symbols for our examples, as leaving the classes empty would make all of them equivalent due to TypeScript's structural subtyping.

As expected, this code compiles. The opposite doesn't: if we can draw a Triangle or a Square and we attempt to draw a Triangle, Square, or Circle, the compiler will complain because we might end up passing a Circle to the draw() function, which wouldn't know what to do with it. We can confirm that the below code doesn't compile:

declare function makeShape(): Triangle | Square | Circle;
declare function draw(shape: Triangle | Square): void;

draw(makeShape());

We flipped the types so makeShape() could also return a Circle, while draw() no longer accepts a Circle. This no longer compiles.

This means that Triangle | Square is a subtype of Triangle | Square | Circle: we can always substitute a Triangle or Square for a Triangle, Square, or Circle, but not the other way around. This might seem counterintuitive, since Triangle | Square is less than Triangle | Square | Circle. Whenever we use inheritance, we end up with a subtype that has more properties than its supertype. For sum types it works the opposite way: the supertype has more types than the subtype.

Say we have an EquilateralTriangle which inherits from Triangle:

declare const EquilateralTriangleType: unique symbol; 
class EquilateralTriangle extends Triangle {
    [EquilateralTriangleType]: void;
    /* EquilateralTriangle members */
}

As an exercise, check what happens when we mix sum types with inheritance. Does makeShape() returning EquilateralTriangle | Square and draw() accepting Triangle | Square | Circle work? What about makeShape() returning Triangle | Square and draw() accepting EquilateralTriangle | Square | Circle?

Subtyping and Collections

Now let's look at types which contain a set of values of some other type. Let's start with arrays: can we pass an array of Triangle objects to a draw() function which accepts an array of Shape objects, if Triangle is a subtype of Shape?

class Shape {
    /* Shape members */
}

declare const TriangleType: unique symbol; 
class Triangle extends Shape {
    [TriangleType]: void;
    /* Triangle members */
}

declare function makeTriangles(): Triangle[];
declare function draw(shapes: Shape[]): void;

draw(makeTriangles());

Triangle is a subtype of Shape. makeTriangles() returns an array of Triangle objects. draw() accepts an array of Shape objects. We can use an array of Triangle objects as an array of Shape objects

This might not be surprising, but it is an important observation: arrays preserve the subtyping relationship of the underlying types they are storing. As expected, the opposite doesn't work: if we try to pass an array of Shape objects where an array of Triangle objects is expected, the code won't compile.

Arrays are basic types that come out-of-the-box in many programming languages. What if we define a custom collection, say a LinkedList?

class LinkedList<T> {
    value: T;
    next: LinkedList<T> | undefined = undefined;

    constructor(value: T) {
        this.value = value;
    }

    append(value: T): LinkedList<T> {
        this.next = new LinkedList(value);
        return this.next;
    }
}

declare function makeTriangles(): LinkedList<Triangle>;
declare function draw(shapes: LinkedList<Shape>): void;

draw(makeTriangles());

LinkedList is a generic linked list collection. makeTriangle() now returns a linked list of traingles. draw() accepts a linked list of shapes. This code compiles.

Even without an out-of-the-box type, TypeScript correctly establishes that LinkedList is a subtype of LinkedList. Like before, the opposite doesn't compile - we can't pass a LinkedList as a LinkedList.

Covariance

A type which preserves the subtyping relationship of its underlying type is called covariant. An array is covariant, because it preserves the subtyping relationship: Triangle is a subtype of Shape, so Triangle[] is a subtype of Shape[].

Various languages behave differently when dealing with arrays and collections like LinkedList. For example, in C# we would have to explicitly state covariance for a type like LinkedList by declaring an interface and using the out keyword (ILinkedList), otherwise the compiler will not deduce the subtyping relationship.

An alternative to covariance is to simply ignore the subtyping relationship between two given types and consider a LinkedList and LinkedList as types with no subtyping relationship between them (neither is a subtype of the other). This is not the case in TypeScript, but it is in C#, where a List and a List have no subtyping relationship.

Invariance

A type which ignores the subtyping relationship of its underlying type is called invariant. A C# List is invariant, because it ignores the subtyping relationship Triangle is a subtype of Shape, so List and List have no subtype-supertype relationship.

Now that we looked at how collections relate to each other in terms of subtyping and saw two common types of variance, let's see how function types related to each other.

Subtyping and Function Return Types

We'll start with the simpler case first: see what substitutions we can make between a function that returns a Triangle and a function that returns a Shape. We'll declare two factory functions, a makeShape() which returns a Shape and a makeTriangle() which returns a Triangle.

We'll then implement a useFactory() function which takes a function of type () => Shape as argument and returns a Shape. We'll try passing makeTriangle() to it:

declare function makeTriangle(): Triangle;
declare function makeShape(): Shape;

function useFactory(factory: () => Shape): Shape {
    return factory();
}

let shape1: Shape = useFactory(makeShape);
let shape2: Shape = useFactory(makeTriangle);

useFactory() takes a function with no arguments which returns a Shape and calls it. Both makeTriangle() and makeShape() can be used as arguments to useFactory().

Nothing out of the ordinary here: we can pass a function that returns a Triangle as a function that returns a Shape, because the return value (a Triangle) is a subtype of Shape, so we can assign it to a Shape.

The opposite doesn't work: if we change our useFactory() to expect a () => Triangle argument and try to pass it makeShape(), the code won't compile:

declare function makeTriangle(): Triangle;
declare function makeShape(): Shape;

function useFactory(factory: () => Triangle): Triangle {
    return factory();
}

let shape1: Shape = useFactory(makeShape);
let shape2: Shape = useFactory(makeTriangle);

We replaced Shape with Triangle in the useFactory() definition. The code fails to compile: we can't use makeShape() as a () => Triangle.

This is again pretty straightforward: we can't use makeShape() as a function of type () => Triangle because makeShape() returns a Shape object. That object could be a Triangle, but it might be a Square. useFactory() promises to return a Triangle, so it can't return a supertype of Triangle. It could, of course, return a subtype, like EquilateralTriangle, given a makeEquilateralTriangle().

Functions are covariant in their return types. In other words, if Triangle is a subtype of Shape, a function type like () => Triangle is a subtype of a function () => Shape. Note that the function types don't have to describe functions that don't take any arguments. If makeTriangle() and makeShape() both took a couple of number arguments, they would still be covariant as we just saw.

This is the behavior followed by most mainstream programming languages. The same rules are followed for overriding methods in inherited types, changing their return type. If we implement a ShapeMaker class which provides a make() method that returns a Shape, we can override it in a derived class TriangleMaker to return Triangle instead. The compiler will allow this, as calling either of the make() methods will give us a Shape object:

class ShapeMaker {
    make(): Shape {
        return new Shape();
    }
}

class TriangleMaker extends ShapeMaker {
    make(): Triangle {
        return new Triangle();
    }
}

This is, again, allowed behavior in most mainstream programming languages, as most consider functions covariant in their return type. Let's now see what happens to function types whose argument types are subtypes of each other.

Subtyping and Function Argument Types

We'll turn things inside out, so instead of a function that returns a Shape and a function that returns a Triangle, we'll take a function that takes a Shape as argument and a function that takes a Triangle as argument. We'll call these drawShape() and drawTriangle(). How do (argument: Shape) => void and (argument: Triangle) => void relate to one another?

Let's introduce another function, render(), which takes as arguments a Triangle and an (argument: Triangle) => void function. It simply calls the given function with the given Triangle:

declare function drawShape(shape: Shape): void;
declare function drawTriangle(triangle: Triangle): void;

function render(
    triangle: Triangle,
    drawFunc: (argument: Triangle) => void): void {
    drawFunc(triangle);
}

drawShape() takes a Shape argument, drawTriangle() takes a Triangle argument. render() expects a Triangle and a function that takes a Triangle as argument. render() simply calls the provided function passing it the triangle it received.

Here comes the interesting bit: in this case, we can safely pass drawShape() to the render() function! That means we can use a (argument: Shape) => void where an (argument: Triangle) => void is expected.

Logically it makes sense: we have a Triangle and we pass it to a drawing function which can use it as an argument. If the function itself expects a Triangle, like our drawTriangle() function, then of course it works. But it should also work for a function which expects a supertype of Triangle: drawShape() wants a shape - any shape - to draw. Since it doesn't use anything that's triangle-specific, it is more general than drawTriangle(), it can accept any shape as argument, be it Triangle or Square. So in this particular case, the subtyping relationship is reversed.

Contravariance

A type which reverses the subtyping relationship of its underlying type is called contravariant. In most programming languages, functions are contravariant with regards to their arguments. A function which expects a Triangle as argument can be substituted with a function which expects a Shape as argument. The relationship of the functions is the reverse of the relationship of the argument types: if Triangle is a subtype Shape, the type of a function taking a Triangle as an argument is a supertype of the type of a function taking a Shape as an argument.

We said most programming languages in the definition above. A notable exception is TypeScript. In TypeScript, we can also do the opposite: pass a function which expects a subtype instead of a function which expects a supertype. This was an explicit design choice, to facilitate common JavaScript programming patterns. It can lead to runtime issues though. Let's look at an example. We'll first define a method isRightAngled() on our Triangle type, which would determine whether a given instance describes a right-angled triangle. The implementation of the method is not important:

class Shape {
    /* Shape members */
}

declare const TriangleType: unique symbol; 
class Triangle extends Shape {
    [TriangleType]: void;

    isRightAngled(): boolean {
        let result: boolean = false;

        /* Determine whether it is a right-angled triangle */

        return result;
    } 

    /* More Triangle members */
}

Now let's reverse the drawing example and let's say our render() function expects a Shape instead of a Triangle, and a function which can draw shapes (argument: Shape) => void instead of a function which can only draw triangles (argument: Triangle) => void:

declare function drawShape(shape: Shape): void;
declare function drawTriangle(triangle: Triangle): void;

function render(
    shape: Shape,
    drawFunc: (argument: Shape) => void): void {
    drawFunc(shape);
}

drawShape() and drawTriangle() are just like before. render() now expects a Shape and a function that takes a Shape as argument.

Here's how we can cause a runtime error: we can define drawTriangle() to actually use something that is triangle-specific, like the isRightAngled() method we just added. We then call render() with a Shape object (not a Triangle) and drawTriangle().

Now drawTriangle() will receive a Shape object and attempt to call isRightAngled() on it, but since the Shape is not a Triangle, this will cause an error:

function drawTriangle(triangle: Triangle): void {
    console.log(triangle.isRightAngled());
    /* ... */
}

function render(
    shape: Shape,    
    drawFunc: (argument: Shape) => void): void {    
    drawFunc(shape);  
}

render(new Shape(), drawTriangle);

We can pass a Shape and drawTriangle() to render(). This code will compile but it will fail at runtime with a JavaScript error, since the runtime won't be able to find isRightAngled() on the Shape object we gave to drawTriangle(). This is not ideal but, as mentioned before, it was a conscious decision made during the implementation of TypeScript.

In TypeScript, if Triangle is a subtype of Shape, a function of type (argument: Shape) => void and a function of type (argument: Triangle) => void can be substituted for each other. Effectively, they are both subtypes of each other. This property is called bivariance.

Bivariance

Types are bivariant if, from the subtyping relationship of their underlying types, they become subtypes of each other. In TypeScript, if Triangle is a subtype of Shape, the function types (argument: Shape) => void and (argument: Triangle) => void are subtypes of each other.

Again, the bivariance of functions with respect to their arguments in TypeScript allows incorrect code to compile. We rely on static type checking to eliminate runtime errors at compile time. For TypeScript it was a deliberate design decision to enable common JavaScript programming patterns.

Summary

We looked at what types can be substituted with what other types. While subtyping is straight-forward when dealing with simple inheritance, things get more complicated when we add types parameterized on other types. These could be collections, function types, or other generic types. The way the subtyping relationships of these parameterized types is removed, preserved, reversed, or made two-way based on the relationship of their underlying types is called variance.

Invariant types ignore the subtyping relationship of their underlying types.
Covariant types preserve the subtyping relationship of their underlying types. If Triangle is a subtype of Shape, an array of type Triangle[] is a subtype of an array of type Shape[]. In most programming languages, function types are covariant in their return types.
Contravariant types reverse the subtyping relationship of their underlying types. If Triangle is a subtype of Shape, the function type (argument: Shape) => void is a subtype of the function type (argument: Triangle) => void in most languages. This is not true for TypeScript, where function types are bivariant with regards to their argument types.
Bivariant types are subtypes of each other when their underlying types are in a subtyping relationship. If Triangle is a subtype of Shape, the function type (argument: Shape) => void and the function type (argument: Triangle) => void are subtypes of each other (functions of both types can be substituted for one another).

While some common rules exist across programming languages, there is no one way to support variance. You should understand what the type system of your programming language does and how it establishes subtyping relationships. This is important to know, as these rules tell us what can be substituted for what. Do you need to implement a function to transform a List into a List, or can you just use the List as-is? It all depends on the variance of List in your programming language of choice.

Notes on Data Engineering

Sun, 08 Dec 2019 00:00:00 -0800

Notes on Data Engineering

For over a year now I've been the architect for the Customer Growth and Analytics team in Azure, scaling out our big data platform as our team grows and matures. I'm going to share a few observations on some of the main problems we've been solving, problems which I believe are fairly universal: I attended the Strata Data Conference this year and speakers from many other companies were talking about similar problems and the solutions they were implementing.

Not too long ago, the major challenges were around storing and processing data at scale. Since this has been more or less commoditized in the past few years, especially with the emergence of cloud providers, it's interesting to think about how the big data landscape evolved and what are some of the present challenges. I list some of them below.

Compliance

Proper handling of data assets is a top priority for Microsoft and should be for everyone. There are several aspects I consider to be part of compliance.

First, there are regulatory obligations, probably the best-known example being GDPR. In order to be compliant with GDPR, a data platform needs to have the ability to forget about a user if the user so desires. A data platform needs to support all capabilities required by regulations of the countries it gets its data from. There are many other regulations, and new ones can come up at any time. Staying compliant is maybe the most important work for a data platform.

Next, there is access control: who gets to see what data. There are various types of data for which access should be restricted. Personally Identifiable Information (PII) is data that can be used to identify a particular person, like name or email address. High Business Impact (HBI) is data relevant to the business, like revenue numbers, devices sold etc. Different companies use different taxonomies to classifies their data assets, but regardless, in the big data space, it is a non trivial problem to ensure that only people who are allowed to access a certain data set are able to do so. If access to sensitive data sets is under a need-to-know basis, and we create one security group per dataset, just managing those security groups is hard in itself. On top of that, people move and organizations change and that can impact who has access to the data too.

I believe there is a lot of room to improve and innovate in this space. There are many existing solutions to handle both regulatory and access control requirements, but nothing that feels like it just works, and definitely little in terms of industry standards.

Metadata

As the data volume grows, organizing it becomes a big challenge. There is a need for an additional layer of data about the data, or metadata. There are several very important pieces of information which are not readily available in the data itself that a big data platform must provide.

First, there is simply the descriptions of various datasets: what the various columns are, how often is the data refreshed, how fresh the data is etc. There also needs to be an ability to search this metadata in order to find relevant data available in the system.

Next, there is data lineage. Where did the data come from and how was it sourced? This has compliance implications: for example if end users agree to share telemetry data for the purpose of improving the product, that data should not be used for other purposes.

Also for compliance purposes, various datasets and columns have to be tagged as containing PII or other sensitive information so the system can automatically lock down or purge this type of sensitive data when needed.

Organizing data at scale also requires some amount of information architecture. One aspect of this is controlled taxonomies: clear definitions of what various business terms and data pieces mean, so everyone working in the space shares the same understanding.

Azure Data Catalog is the Azure offering in this space.

Heterogeneity by Design

There is no one-size-fits-all data fabric. Each storage solution has some workflows it was optimized for and some it's not so great at. It's next to impossible to say that absolutely everything will be running on one single solution, be that SQL, NoSQL, HDFS, or something else. Some workflows need massive scale (processing terabytes of data), some workflows need fast reads (serving a website). Teams upstream will expose data from different storage solutions while teams downstream will expect it in different storage solutions...

Standardizing on a unique storage solution is unfeasible, so the next best thing to do is to standardize on the tooling to move the data around and ensure that it is easy to operate: make it easy to create a new data movement pipeline, provide monitoring, alerting etc. Since data movement is a necessity, it must be as reliable as possible.

Our team uses Azure Data Factory for orchestrating data movement at scale.

DevOps

Another major bucket of work is bringing engineering rigor to workflows supported by other disciplines like data science and machine learning engineering. Again, with a small team, it is relatively easy to create ad-hoc reports and run ad-hoc ML but this approach doesn't scale. Once production systems depend on the output.

This is a solved problem in the software engineering field: source control, code reviews, continuous integration and so on. But non-engineering disciplines are not accustomed to this type of workflow so there is definitely a need to educate, support, and create similar devops workflows. Analytics and ML ultimately reduce to code (SQL queries, Python, R etc.) and should be handled just as production code.

Our team supports these types of workflows using Azure DevOps with pipelines that can deploy ML and analytics from git to our production environment.

Data Quality

The last topic I will cover is data quality. The quality of all analytics and machine learning outputs depends on the quality of the underlying data.

There are multiple aspects to data quality. One set of definitions is given by the article Data Done Right: 6 Dimensions of Data Quality:

Completeness - the dataset is not missing any required data.
Consistency - the data is consistent across multiple datasets.
Conformity - all data in the right format, within the right value ranges etc.
Accuracy - the data accurately represents the domain being modelled.
Integrity - the data is valid across all relationships and datasets.
Timeliness - the data available when expected and datasets are not delayed.

A reliable data platform can run various types of data quality tests on the managed datasets, both at scheduled times and during ingress. Issues have to be reported and the overall state of the data quality made visible through a dashboard so stakeholders can easily see which datasets currently have problems and what are the potential implications of that.

As of today, this seems to be a big gap in terms of industry-wide standard solutions. Data engineering teams develop their bespoke data test runners for their scenarios. There are many open source solutions, but we don't have the equivalent of JUnit yet, nor a common language for specifying tests and assertions.

Conclusions

In the following years, I expect we will have both better tooling for some of these problems and better defined industry-wide standards. As I mentioned at the beginning of this post, not long ago, just storing and processing huge amounts of data was a hard problem. Today, the main challenges are around organizing and managing data. I fully expect that in the near future we will have out-of-the-box solutions for all these problems and a new set of challenges will emerge.

Unit Testing 101

Mon, 18 Nov 2019 00:00:00 -0800

Unit Testing 101

I wrote a while back about unit testing from a philosophical perspective. This post is going to be more pragmatic. My team is currently doing some MQ work which includes improving our unit test story across the codebase. I put together a short unit tests 101 presentation outlining some key principles:

Run with each build.
100% reliability.
Test the public interface.

Run with Each Build

Unit tests that don't run aren't very useful. I've seen projects before where a unit test project does exist but the tests only run if manually executed.

The problem with this approach is that tests can stay not running for days/weeks/months, and when they finally run, a bunch of them fail. Good luck finding the change that introduced the regression. And wait, were we running with that behavior all this time?

The biggest bang for the buck is making sure unit tests run as part of a continuous integration build and pull requests get auto-rejected if a unit test fails.

100% Reliability

Once unit tests run with each build, the next most important thing to look into is ensuring they pass consistently. Flaky unit tests are bad because there's no easy way to tell if a test run failed because of a regression or a flaky test. Worst, if flaky tests are the standard, engineers start ignoring the results. Hard to distill the signal from the noise in those situations. Merge policies become more lax - after all, we can't demand 100% green if some unit tests randomly fail.

But stepping back, when are tests flaky? When they perform IO. Hitting the network, connecting to a database, reading a file, these are all cases in which a transient issue outside of our control can cause a test to fail. That's why unit tests shouldn't perform IO, rather they should work against mocks.

Let's take, as an example, a method which performs a GET request and logs to the console whether the request was successful:

class Example
{
    public void Get()
    {
        var client = new HttpClient();
        var response = client.GetAsync(
            "https://www.example.com").Result;

        Console.WriteLine(response.IsSuccessStatusCode);
    }
}

In its current form, the method isn't really testable. Writing code without thinking about testability yields such methods. We can refactor this to be more testable. First, let's put all IO behind interfaces:

interface IHttpClient
{
    Task<HttpResponseMessage> GetAsync(string url);
}

interface IOutput
{
    void WriteLine(bool value);
}

We can update our Example to use these interfaces instead of directly working with HttpClient and Console:

class Example
{
    private IHttpClient client;
    private IOutput output;

    public Example(IHttpClient client, IOutput output)
    {
        this.client = client;
        this.output = output;
    }

    public void Get()
    {
        var response = client.GetAsync(
            "https://www.example.com").Result;

        output.WriteLine(response.IsSuccessStatusCode);
    }
}

We add adapters between the interfaces and the actual implementations:

class HttpClientWrapper : IHttpClient
{
    private HttpClient client = new HttpClient();

    public Task<HttpResponseMessage> GetAsync(string url)
        => client.GetAsync(url);
}

class ConsoleOutput : IOutput
{
    public void WriteLine(bool value)
        => Console.WriteLine(value);
}

With these adapters, in our production code we can put together an instance of Example that works just like the original, but which is componentized enough that we can actually test it:

var example = new Example(
    new HttpClientWrapper(),
    new ConsoleOutput());
example.Get();

In our test code, we can use a framework like Moq¹ to set up mocks and verify that the expected calls happen:

var mockClient = new Mock<IHttpClient>();
mockClient.Setup(
    client => client.GetAsync("https://www.example.com"))
    .Returns(Task.FromResult(
        new HttpResponseMessage {
            StatusCode = HttpStatusCode.OK
        }));

var mockOutput = new Mock<IOutput>();
mockOutput.Setup(
    output => output.WriteLine(
        It.Is<bool>(value => value == true)));

var example = new Example(
    mockClient.Object,
    mockOutput.Object);
example.Get();

mockOutput.VerifyAll();

The above code sets up an IHttpClient mock implementation which so that when GetAsync() is called with the argument https://www.example.com it returns a Task with a StatusCode of HttpStatusCode.OK. The code also sets up an IOutput mock which expects a WriteLine() call with a true argument.

We can initialize an instance of Example with these mocks, call Get(), then verify mockOutput was used as expected.

Design for Testability

The general steps for making code testable:

Extract interface (if one doesn't exist already).
Create adapters if concrete implementation doesn't implement an interface.
Initialize class with real implementations in production.
Initialize class with mocks in tests.
Setup mocks to behave as required by each test.
Verify mocks.

I will not talk about dependency injection in this post, but once all components of the system expect several interfaces to run, it is worth thinking about leveraging a DI framework to handle putting things together.

With this approach, we can make any component testable except the adapters. By their nature, our adapters perform IO. We can't reliably test HttpClientWrapper. But such adapters shouldn't contain any application logic, they should be extremely thin, simply forwarding calls to the real implementation. It's perfectly fine to not test such trivial code.

Seams

Depending on the language, we can have several other ways to inject mocks. In C++, for example, we can do it at compile-time, at link-time, or at run-time.

At compile-time, we can use a template parameter as the interface, have the production version of the code instantiate it with one concrete implementation and have the tests instantiate it with a mock:

template <typename TImpl>
class Example
{
private:
    TImpl impl;

public:
    void Do() {
        impl.Do();
    }
};

class ConcreteImpl
{
public:
    void Do() {
        // Concrete implementation
    }
};

class MockImpl
{
public:
    void Do() {
        // Mock implementation
    }
};

// ...

Example<ConcreteImpl> ex;

At link-time, we can link against the concrete implementations in production and against mock implementations in tests:

class Example
{
private:
    Impl impl;

public:
    void Do() {
        impl.Do();
    }
};

// In concrete implementation file:
class Impl
{
public:
    void Do() {
        // Concrete implementation
    }
};

// In mock implementation file:
class Impl
{
public:
    void Do() {
        // Mock implementation
    }
};

At run-time, we can do something similar to our C# example above. There are pros and cons with each approach. The run-time approach is what most languages do, so easy to understand, though it adds more overhead. The link-time approach is lean, but could end up being confusing: we have to check makefiles to understand which code ends up in the binary and which code doesn't. The compile-time approach makes the code uglier, and requires making implementation public.

Test The Public Interface

This one I did mention in my previous blog post to. The key point here is that, while test frameworks usually provide various unnatural ways to access an object's internals, tests should focus on the public members.

The public members define the contract that a class provides. Tests should ensure the contract is respected and not worry about the implementation. With this approach, the implementation can easily be refactored and we know things still work as expected as long as all tests pass. On the other hand, if we have tests that cover various implementation details, they might break if we move things around, even though the class still behaves correctly. In general, having to update tests whenever we make tweaks to the implementation is not ideal.

The other way to look at it is that if we have some code deep in the implementation that can't be reached through the public members, then it is likely dead code.

Summary

Unit tests should run as part of continuous integration, otherwise they aren't really useful.
Unit tests have to be 100% reliable. We achieve this by isolating IO and mocking it in tests.
Testability recipe:
- Code against interfaces, declare interfaces if none are available
- Use thin adapters to make any concrete implementation compatible with any interface
- Use concrete implementations in production and mocks in tests
In some languages there are multiple seams where we can inject mocks. In C++ we can do it at compile-time, at link-time, and at run-time. Each has its pros and cons.
Test the public interface not the implementation.

Moq is my favorite C# mocking library: https://github.com/moq/moq4. ↩

Programming with Types RTM

Wed, 16 Oct 2019 00:00:00 -0700

Programming with Types RTM

Programming with Types is going to the printer today and should be available as a print books in a couple of weeks. I figured I'll write a blog post about writing the book for the occasion. I wrote in a previous post how the book came to be, but didn't really cover the implementation details.

Prototype

I started writing the book before even thinking about publishing. The best way I can put it is I felt I had a book in me. I learned a lot of material during the past few years and I wanted to synthesize it in book form, hopefully useful to others. I created a repository on GitHub, outlined the topics I wanted to cover, and for a few weeks I spent a few minutes every morning over coffee filling in some of the sections. As a fun fact, the original title of the book was Practical Types. This was August 2018.

Pitch

In September, when I realized I am indeed writing a book, I figured I should pitch it to publishers. No harm in trying. I reached out to O'Reilly and Manning. O'Reilly promptly dismissed my proposal. Manning put me in touch with a publisher and we scheduled a couple of calls to discuss some of the details.

I got some invaluable feedback even from those initial calls - suggestions on how to improve the table of contents, how to make the book more accessible to readers, how to split topics up into chapters based on the difficulty of the topic and so on. After those initial calls, I was fortunate to get a contract.

Contract

The most important part of the contract is the manuscript delivery. There are 3 key milestones, each 4 months apart. The milestones were:

Deliver a third of the manuscript no later than February 2019.
Deliver two thirds of the manuscript no later than June 2019.
Deliver the complete draft of the manuscript no later than October 2019.

I got paid an advance delivered in two batches: the first half upon delivery of the first third of the manuscript, the second half upon delivery of the complete draft. I will also be getting a percentage of sales of the book. The way advances work, the book needs to sell enough to cover the advance before I will receive any more royalties. I will pause here to say that I don't expect to get rich off of this book. Another fun fact I learned is that most books don't even recover the advance, so if you ever want to write a book simply for the money, you might want to reconsider.

Training

After the paperwork was signed, I was assigned a developmental editor and the first thing we did was a new author training where I was introduced to the Manning method of writing books. Since Manning has been publishing high quality technical books for many years, they have a set of best practices authors are encouraged to follow. These include things like starting with a concrete example then generalizing to the abstract, using figures to better explain things, tracking the prerequisites of each chapter and making sure there is a smooth progression and so on. These trainings were delivered by my editor over a few Skype calls.

Feedback Loop

Throughout the writing stage, I worked closely with two editors. After completing a draft of each chapter, I would upload it and get the initial round of feedback from them. The developmental editor would make sure my writing is up to par and the text makes sense; the technical developmental editor would keep me honest for the technical aspects of the book, and make sure I didn't mess anything up there.

With each milestone (1/3, 2/3, 3/3 of the book), Manning also conducted an external review with around 20 other volunteers who would go over the manuscript and provide feedback. When the external feedback came in, we analyzed it for trends to identify areas of improvement - for example, if multiple people were saying a particular section is hard to grok, it was clear it needed improvement.

As the book was nearing completion, I also got a reviewer for the code samples. He made sure all the code makes sense, compiles, and the GitHub version matches the book version.

Writing a book is definitely a team effort. Without all the great feedback I received it would've been a pretty crappy book. Many thanks to Michael Stephens, Elesha Hyde, Mike Shepard, German Gonzales, and all the other great folks at Manning for all their help. I'm also grateful to all the external reviewers for their time.

I learned that the first chapter is the one that usually requires the most refactoring, which was definitely true in my case.

Writing

Because I started with a list of topics I wanted to cover and had a general idea of what to write in each chapter, things went pretty smoothly. I was able to write a chapter in a weekend so I was ahead of schedule most of the time.

Turns out that writing a chapter is not the hard part, the grind comes when you have to edit, re-edit, and incorporate all the rounds of feedback. The editing part is much less creative than starting from a blank page, but is equally if not more important.

I completed the draft mid-May and incorporated the last round of external feedback by July. Beyond the chapters, there is also a Front Matter, inside covers, and appendices, which I worked on in July/August.

All in all it took about a year from when I started outlining the book on GitHub to having the final version ready to go to production.

Production

From then on, the book goes to production and another set of people goes over the spelling, layout, indexing and so on. At this stage, my responsibility was mostly to review changes and suggest minor updates.

Last week I received the final PDF version of how the book will look like in print, and today I was notified it is being sent to the printer.

Writing the book was a great experience and I got the chance to work with some wonderful people. I hope the final product is a good book which leads to better software.

Higher Kinded Types: Monads

Sat, 07 Sep 2019 00:00:00 -0700

Higher Kinded Types: Monads

Make sure to read the previous post first, Higher Kinded Types: Functors.

Monads

You have probably heard the term monad, as it's been getting a lot of attention lately. Monads are making their way into mainstream programming, so you should know one when you see it. Building on the previous blog post, in this post we will explain what a monad is and how it is useful. We'll start with a few examples and then look at the general definition.

Result or Error

In the previous post, we had a readNumber() function that returned number | undefined. We used functors to sequence processing with square() and stringify(), so that if readNumber() returns undefined, no processing happens, and the undefined is propagated through the pipeline.

This type of sequencing works with functors as long as only the first function - in this case, readNumber() - can return an error. But what happens if any of the functions we want to chain can error out? Let's say that we want to open a file, read its content as a string, and then deserialize that string into a Cat object.

We have an openFile() function that returns an Error or a FileHandle. Errors can occur if the file doesn't exist, if it is locked by another process, or if the user doesn't have permission to open it. If the operation succeeds, we get back a handle to the file.

We have a readFile() function that takes a FileHandle and returns ether an Error or a string. Errors can occur if the file can't be read, perhaps due to being too large to fit in memory. If the file can be read, we get back a string.

Finally, deserializeCat() function takes a string and returns an Error or a Cat instance. Errors can occur if the string can't be deserialized into a Cat object, perhaps due to missing properties.

All these functions follow the return result or error pattern, which suggests returning either a valid result or an error from a function, but not both. The return type will be an Either:

declare function openFile(
    path: string): Either<Error, FileHandle>;

declare function readFile(
    handle: FileHandle): Either<Error, string>;

declare function deserializeCat(
    serializedCat: string): Either<Error, Cat>;

We are omitting the implementations, as they are not important. Let's also quickly see the implementation of Either:

class Either<TLeft, TRight> {
    private readonly value: TLeft | TRight;
    private readonly left: boolean;

    private constructor(value: TLeft | TRight, left: boolean) {
        this.value = value;
        this.left = left;
    }

    isLeft(): boolean {
        return this.left;
    }

    getLeft(): TLeft {
        if (!this.isLeft()) throw new Error();

        return <TLeft>this.value;
    }

    isRight(): boolean {
        return !this.left;
    }

    getRight(): TRight {
        if (this.isRight()) throw new Error();

        return <TRight>this.value;
    }

    static makeLeft<TLeft, TRight>(value: TLeft) {
        return new Either<TLeft, TRight>(value, true);
    }

    static makeRight<TLeft, TRight>(value: TRight) {
        return new Either<TLeft, TRight>(value, false);
    }
}

The type wraps a value of either TLeft or TRight and a flag to keep track of that type is used. It has a private constructor, as we need to make sure that the value and boolean flag are in sync. Attempting to get a TLeft when we have a TRight, or vice versa, throws an error. The factory functions call the constructor and ensure that the boolean flag is consistent with the value.

Now let's see how we could chain these functions together into a readCatFromFile() function that takes a file path as an argument and returns an Error if anything went wrong along the way, or a Cat instance:

function readCatFromFile(path: string): Either<Error, Cat> {
    let handle: Either<Error, FileHandle> = openFile(path);

    if (handle.isLeft()) return Either.makeLeft(handle.getLeft());

    let content: Either<Error, string> = readFile(handle.getRight());

    if (content.isLeft()) return Either.makeLeft(content.getLeft());

    return deserializeCat(content.getRight());
}

This function is very similar to the first implementation of process() in the previous blog post. There, we provided an updated implementation that removed all the branching and error checking from the function and delegated those tasks to map(). Let's see what a map() for Either would look like. We will follow the convention Right is right; left is error, which means that TLeft contains an error, so map() will just propagate it. map() will apply a given function only if the Either contains a TRight:

namespace Either {
    export function map<TLeft, TRight, URight>(
        value: Either<TLeft, TRight>,
        func: (value: TRight) => URight): Either<TLeft, URight> {
        if (value.isLeft()) return Either.makeLeft(value.getLeft());

        return Either.makeRight(func(value.getRight()));
    }
}

There is a problem with using map(), though: the types of the functions it expects as argument is incompatible with the functions we are using. With map(), after we call openFile() and get back an Either, we would need a function (value: FileHandle) => string to read its content. That function can't itself return an Error, like square() or stringify(). But in our case, readFile() itself can fail, so it doesn't return string, it returns Either. If we attempt to use it in our readCatFromFile(), we get a compilation error:

function readCatFromFile(path: string): Either<Error, Cat> {
    let handle: Either<Error, FileHandle> = openFile(path);

    let content: Either<Error, string> = Either.map(handle, readFile);

    /* ... */
}

This fails to compile due to a type mismatch. The error message we get is

Type 'Either>' is not assignable to type 'Either'.

Our functor falls short here. Functors can propagate an initial error through the processing pipeline, but if every step in the pipeline can fail, functors no longer work. In the following figure, the black square represents an Error, and the white and black circles represent two types, such as FileHandle and string.

We can't use a functor in this case because the functor is defined to map a function from a white circle to a black circle. Unfortunately, our function returns a type already wrapped in an Either (an Either). We need an alternative to map() that can deal with this type of function.

map() from Either would need a function from FileHandle to string to produce an Either. Our readFile() function, on the other hand, is from FileHandle to Either.

This problem is easy to fix. We need a function similar to map() that goes from T to Either. The standard name for such a function is bind():

namespace Either {
    export function bind<TLeft, TRight, URight>(
        value: Either<TLeft, TRight>,
        func: (value: TRight) => Either<TLeft, URight>
        ): Either<TLeft, URight> {
        if (value.isLeft()) return Either.makeLeft(value.getLeft());

        return func(value.getRight());
    }
}

func() has a different type from the func() in map(). We can simply return the result of func(), as it has the same type as the result of bind().

As we can see, the implementation is even simpler than the one for map(): after we unpack the value, we simply return the result of applying func() to it. Let's use bind() to implement our readCatFromFile() function and get the desired branchless error propagation behavior:

function readCatFromFile(path: string): Either<Error, Cat> {
    let handle: Either<Error, FileHandle> = openFile(path)

    let content: Either<Error, string> =
        Either.bind(handle, readFile);

    return Either.bind(content, deserializeCat);
}

Unlike the map() version, this code works. Applying readFile() to handle gives us back an Either. deserializeCat() has the same return type as readCatFromFile(), so we simply return the result of bind().

This version seamlessly chains together openFile(), readFile(), and deserializeCat() so that if any of the functions fails, the error gets propagated as the result of readCatFromFile(). Again, branching is encapsulated in the bind() implementation, so our processing function is linear.

Difference between map() and bind()

Before moving on to define monads, let's take another simplified example and contrast map() and bind(). We'll again use Box, a generic type that simply wraps a value of type T. Although this type is not particularly useful, it is the simplest generic type we can have. We want to focus on how map() and bind() work with values of types T and U in some generic context, such as Box, Box (or T[], U[]; or Optional, Optional; or Either, Either etc.).

For a Box, a functor (map()) takes a Box and a function from T to U and returns a Box. The problem is that we have scenarios in which our functions are directly from T to Box. This is what bind() is for. bind() takes a Box and a function from T to Box and returns the result of applying the function to the T inside Box.

If we have a function stringify() that takes a number and returns its string representation, we can map() it on a Box and get back a Box:

namespace Box {
    export function map<T, U>(
        box: Box<T>,
        func: (value: T) => U): Box<U> {
        return new Box<U>(func(box.value));
    }
}

function stringify(value: number): string {
    return value.toString();
}

const s: Box<string>
    = Box.map(new Box(42), stringify);

If instead of stringify(), which goes from number to string, we have a boxify() function that goes from number directly to Box[, ]{.title-ref}[map()]{.title-ref}[ won't work. We'll need ]{.title-ref}[bind()]{.title-ref}` instead:

namespace Box {
    export function bind<T, U>(
        box: Box<T>,
        func: (value: T) => Box<U>): Box<U> {
        return func(box.value);    
    }
}

function boxify(value: number): Box<string> {
    return new Box(value.toString());
}

const b: Box<string> =
    Box.bind(new Box(42), boxify);

The result of both map() and bind() is still a Box. We still go from Box to Box; the difference is how we get there. In the map() case, we need a function from T to U. In the bind() case, we need a function from T to Box.

The Monad Pattern

A monad consists of bind() and one more, simpler function. This other function takes a type T and wraps it into the generic type, such as Box, T[], Optional, or Either. This function is usually called return() or unit().

A monad allows structuring programs generically while encapsulating away boilerplate code needed by the program logic. With monads, a sequence of function calls can be expressed as a pipeline that abstracts away data management, control flow, or side effects.

Let's look at a few examples of monads. We can start with our simple Box type and add unit() to it to complete the monad:

namespace Box {
    export function unit<T>(value: T): Box<T> {
        return new Box(value);
    }

    export function bind<T, U>(
        box: Box<T>, 
        func: (value: T) => Box<U>): Box<U> {
        return func(box.value);    
    }
}

unit() simply calls Box's constructor to wrap the given value into an instance of Box. bind() unpacks the value from Box and calls func() on it.

The implementation is very straightforward. Let's look at the Optional monad functions:

namespace Optional {
    export function unit<T>(value: T): Optional<T> {
        return new Optional(value);
    }

    export function bind<T, U>(
        optional: Optional<T>,
        func: (value: T) => Optional<U>): Optional<U> {
        if (!optional.hasValue()) return new Optional();

        return func(optional.getValue());
    }
}

unit() takes a value of type T and wraps it into an Optional. If the optional is empty, bind() returns an empty optional of type Optional. If the optional contains a value, bind() return the result of calling func() on it.

Very much as with functors, if a programming language can't express higher kinded types, we don't have a good way to specify a Monad interface. Instead, let's think of monads as a pattern:

A monad is a generic type H for which we have a function like unit(), that takes a value of type T and returns a value of type H, and a function like bind() that takes a value of type H and a function from T to H, and returns a value of type H.

Bear in mind that because most languages use this pattern, without a way to specify an interface for the compiler to check, in many instances the two functions, unit() and bind(), may show up under different names. You may hear the term monadic, as in monadic error handling, which means that error handling follows the monad pattern.

Next, we'll look at a few other examples.

The Continuation Monad

A promise represents the result of a computation that will happen sometime in the future. Promise is the promise of a value of type T. We can schedule execution of asynchronous code by chaining promises, using the then() function.

Let's say we have a function that determines our location on the map. Because this function will work with the GPS, it may take longer to finish, so we make it asynchronous. It will return a promise of type Promise. Next, we have a function that, given a location, will contact a ride-sharing service to get us a Car:

declare function getLocation(): Promise<Location>;
declare function hailRideshare(
    location: Location): Promise<Car>;

let car: Promise<Car> = getLocation().then(hailRideshare);

When getLocation() returns, hailRideshare() will be invoked with its result. This should look very familiar to you at this point. then() is just how Promise spells bind()!

we can also create an instantly resolved promise by using Promise.resolve(). This takes a value and returns a resolved promise containing that value, which is the Promise equivalent of unit().

Turns out chaining promises, an API available in virtually all mainstream programming languages, is monadic. It follows the same pattern that we saw in this section, but in a different domain. While dealing with error propagation, our monad encapsulated checking whether we have a value that we can continue operating on or have an error that we should propagate. With promises, the monad encapsulates the intricacies of scheduling and resuming execution. The pattern is the same, though.

The List Monad

Another commonly used monad is the list monad. Let's look at an implementation over sequences: a divisors() function that takes a number n and returns an array containing all of its divisors except 1 and n itself.

This straightforward implementation starts from 2 and goes up to half of n, and adds all numbers it finds that divide n without a remainder. There are more efficient ways to find all divisors of a number, but we'll stick to a simple algorithm in this case:

function divisors(n: number): number[] {
    let result: number[] = [];

    for (let i = 2; i <= n / 2; i++) {
        if (n % i == 0) {
            result.push(i);
        }
    }

    return result;
}

Now let's say we want to take an array of numbers and return an array containing all their divisors. We don't need to worry about dupes. One way to do this is to provide a function that takes an array of input numbers, applies divisors() to each of them, and joins the results of all the calls to divisors() into a final result:

function allDivisors(ns: number[]): number[] {
    let result: number[] = [];

    for (const n of ns) {
        result = result.concat(divisors(n));
    }

    return result;
}

It turns out that this pattern is common. Let's say that we have another function, anagrams(), that generates all permutations of a string and returns an array of strings. If we want to get the set of all anagrams of an array of strings, we would end up implementing a very similar function:

declare function anagram(input: string): string[];

function allAnagrams(inputs: string[]): string[] {
    let result: string[] = [];

    for (const input of inputs) {
        result = result.concat(anagram(input));
    }

    return result;
}

allAnagrams() is very similar to allDivisors().

Now let's see whether we can replace allDivisors() and allAnagrams() with a generic function. This function would take an array of Ts and a function from T to an array of Us, and return an array of Us:

function bind<T, U>(
    inputs: T[],
    func: (value: T) => U[]): U[] {
    let result: U[] = [];

    for (const input of inputs) {
        result = result.concat(func(input));
    }

    return result;
}

function allDivisors(ns: number[]): number[] {
    return bind(ns, divisors);
}

function allAnagrams(inputs: string[]): string[] {
    return bind(inputs, anagram);
}

As you've probably guessed, this is the bind() implementation for the list monad. In the case of lists, bind() flattens the arrays returned by each call of the given function into a single array. While the error-propagating monad decides whether to propagate an error or apply a function and the continuation monad wraps scheduling, the list monad combines a set of results (a list of lists) into a single flat list. In this case, the box is a sequence of values.

The unit() implementation is trivial. Given a value of type T, it returns a list containing just that value. This monad generalizes to all kinds of lists: arrays, linked lists, and iterator ranges.

Category theory

Functors and monads come from category theory, a branch of mathematics that deals with structures consisting of objects and arrows between these objects. With these small building blocks, we can build up structures such as functors and monads. We won't go into its details now; we'll just say that multiple domains, like set theory and even type systems, can be expressed in category theory.

Haskell is a programming language that took a lot of inspiration from category theory, so its syntax and standard library make it easy to express concepts such as functors, monads, and other structures. Haskell fully supports higher kinded types.

Maybe because the building blocks of category theory are so simple, the abstractions we've been talking about are applicable across so many domains. We just saw that monads are useful in the context of error propagation, asynchronous code, and sequence processing.

Although most mainstream languages still treat monads as patterns instead of proper constructs, they are definitely useful structures that show up over and over in different contexts.

Other Monads

A couple of other common monads, which are popular in functional programming languages with pure functions (functions that don't have side effects) and immutable data, are the state monad and the IO monad. We'll provide only a high-level overview of these monads, but if you decide to learn a functional programming language such as Haskell, you will likely encounter them early in your journey.

The state monad encapsulates a piece of state that it passes along with a value. This monad enables us to write pure functions that, given a current state, produce a value and an updated state. Chaining these together with bind() allows us to propagate and update state through a pipeline without explicitly storing it in a variable, enabling purely functional code to process and update state.

The IO monad encapsulates side effects. It allows us to implement pure functions that can still read user input or write to a file or terminal because the impure behavior is removed from the function and wrapped in the IO monad.

Higher Kinded Types: Functors

Fri, 06 Sep 2019 00:00:00 -0700

Higher Kinded Types: Functors

An Even More General Map

In the previous post we saw a generic map() implementation working on iterators. Iterators abstract data structure traversal, so map() can apply a function to elements in any data structure.

In the figure, map() takes an iterator over a sequence, in this case a list of circles, and a function which transforms a circle. map() applies the function to each element in the sequence, and produces a new sequence with the transformed elements.

function* map<T, U>(
    iter: Iterable<T>,
    func: (item: T) => U): IterableIterator<U> {
    for (const value of iter) {
        yield func(value);
    }
}

This implementation works on iterators, but we should be able to apply a function of the form (item: T) => U to other types too. Let's take, as an example, an Optional type:

class Optional<T> {
    private value: T | undefined;
    private assigned: boolean;

    constructor(value?: T) {
        if (value) {
            this.value = value;
            this.assigned = true;
        } else {
            this.value = undefined;
            this.assigned = false;
        }
    }

    hasValue(): boolean {
        return this.assigned;
    }

    getValue(): T {
        if (!this.assigned) throw Error();

        return <T>this.value;
    }
}

It feels natural to be able to map a function (value: T) => U over an Optional. If the optional contains a value of type T, mapping the function over it should return an Optional containing the result of applying the function. On the other hand, if the optional doesn't contain a value, mapping would result in an empty Optional.

Let's sketch out an implementation for this. We'll put this function in a namespace. Since TypeScript doesn't support function overloading, in order to have multiple functions with the same name, we need this so the compiler can determine which function we are calling. Here's the Optional map() implementation:

namespace Optional {
    export function map<T, U>(
        optional: Optional<T>,
        func: (value: T) => U): Optional<U> {
        if (optional.hasValue()) {
            return new Optional<U>(func(optional.getValue()));
        } else {
            return new Optional<U>();
        }
    }
}

export simply makes the function visible outside the namespace. If the optional has a value, we extract it, pass it to func(), and use its result to initialize an Optional. If the optional is empty, we create a new empty Optional.

We can do something very similar with the TypeScript sum type T or undefined. The Optional we just saw is a DIY version of such a type that works even in languages which don't support sum types natively, but TypeScript does. Let's see how we can map over a native optional type T | undefined.

Mapping a function (value: T) => U over T | undefined should apply the function and return its result if we have a value of type T, or return undefined if we start with undefined:

namespace SumType {
    export function map<T, U>(
        value: T | undefined,
        func: (value: T) => U): U | undefined {
        if (value == undefined) {
            return undefined;
        } else {
            return func(value);
        }
    }
}

These types can't be iterated over, but it still makes sense for a map() function to exist for them. Let's define another simple generic type, Box. This type simply wraps a value of type T:

class Box<T> {
    value: T;

    constructor(value: T) {
        this.value = value;
    }
}

Can we map a function (value: T) => U over this type? We can. As you might have guessed, map() for Box would return a Box: it will take the value T out of Box, apply the function to it, and put the result back into a Box.

namespace Box {
    export function map<T, U>(
        box: Box<T>,
        func: (value: T) => U): Box<U> {
        return new Box<U>(func(box.value));
    }
}

There are many generic types over which we can map functions. Why is this useful? It's useful because map(), just like iterators, provides another way to decouple types which store data from functions which operate on that data.

Processing Results or Propagating Errors

As a concrete example, let's take a couple of functions which process a numerical value. We'll implement a simple square(), a function which takes a number as an argument and returns its square. We'll also implement stringify(), a function which takes a number as an argument and returns its string representation:

function square(value: number): number {
    return value ** 2;
}

function stringify(value: number): string {
    return value.toString();
}

Now let's say we have a function readNumber(), which reads a numeric value from a file. Since we are dealing with input, we might run into some problems: what if the file doesn't exist or can't be opened? In that case, readNumber() will return undefined. We won't look at the implementation of this function, the important thing for our example is its return type:

function readNumber(): number | undefined {
    /* Implementation omitted */
}

If we want to read a number and process it by applying square() to it first, then stringify(), we need to ensure we actually have a numerical value as opposed to undefined. A possible implementation is to convert from number | undefined to just number using if statements wherever needed:

function process(): string | undefined {
    let value: number | undefined = readNumber();

    if (value == undefined) return undefined;

    return stringify(square(value));
}

We have two functions that operate on numbers, but since our input can also be undefined, we need to explicitly handle that case. This is not particularly bad, but in general the less branching our code has, the less complex it is. It is easier to understand, to maintain, and there are less opportunities for bugs. Another way to look at this is that process() itself simply propagates undefined, it doesn't do anything useful with it. It would be better if we can keep process() responsible for processing, and let someone else handle error cases. How can we do this? With the map() we implemented for sum types:

namespace SumType {
    export function map<T, U>(
        value: T | undefined,
        func: (value: T) => U): U | undefined {
        if (value == undefined) {
            return undefined;
        } else {
            return func(value);
        }
    }
}

function process(): string | undefined {
    let value: number | undefined = readNumber();

    let squaredValue = SumType.map(value, square);

    return SumType.map(squaredValue, stringify);
}

Instead of explicitly checking for undefined, we call map() to apply square() on the value. If it is undefined, map() will give us back undefined. Just like with square(), we map() our stringify() function on the squaredValue. If it is undefined, map() will return undefined.

Now our process() implementation has no branching -- the responsibility of unpacking number | undefined into a number and checking for undefined is handled by map(). map() is generic and can be used across many other types (like string | undefined) and in many other processing functions.

In our case, since square() is guaranteed to return a number, we can create a small lambda which chains square() and stringify(), and pass that to map():

function process(): string | undefined {
    let value: number | undefined = readNumber();

    return SumType.map(value,
        (value: number) => stringify(square(value)));
}

This is a functional implementation of process(), in which the error propagation is delegated to map(). We'll talk more about error handling in a later blog post, when we will discuss monads. For now, let's look at another application of map().

Mix-and-match Function Application

Without the map() family of functions, if we have a square() function which squares a number, we would have to implement some additional logic to get a number from a number | undefined sum type. Similarly, we would have to implement some additional logic to get a value from a Box, and package it back in a Box:

function squareSumType(value: number | undefined)
    : number | undefined {
    if (value == undefined) return undefined;

    return square(value);
}

function squareBox(box: Box<number>): Box<number> {
    return new Box(square(box.value));
}

So far this isn't too bad. But what if we want something similar with stringify()? We'll again end up writing two functions which look a lot like the previous ones:

function stringifySumType(value: number | undefined)
    : string | undefined {
    if (value == undefined) return undefined;

    return stringify(value);
}

function stringifyBox(box: Box<number>): Box<string> {
    return new Box(stringify(box.value))
}

This starts to look like duplicate code, which is never good. If we have map() functions available for number | undefined and Box, they provide the abstraction to remove the duplicate code. We can pass either square() or stringify() to either SumType.map() or to Box.map(), no additional code needed:

let x: number | undefined = 1;
let y: Box<number> = new Box(42);

console.log(SumType.map(x, stringify));
console.log(Box.map(y, stringify));

console.log(SumType.map(x, square));
console.log(Box.map(y, square));

Now let's define this family of map() functions.

Functors and Higher Kinded Types

What we just talked about in this section are functors.

A functor is a generalization of functions that perform mapping operations. For any generic type like Box, a map() operation which takes a Box and a function from T to U and produces a Box is a functor.

In the figure we have a generic type H which contains 0, 1, or more values of some type T, and a function from T to U. In this case T is an empty circle and U is a full circle. The map() functor unpacks the T or Ts from the H instance, applies the function, then places the result back into an H.

Functors are extremely powerful concepts, but most mainstream languages do not have a good way to express them. That's because the general definition of a functor relies on higher kinded types.

A generic type is a type which has a type parameter, for example a generic type T, or a type like Box, have a type parameter T. A higher kinded type, just like a higher-order function, represents a type parameter with another type parameter. For example, T or Box>, have a type parameter T which, in turn, has a type parameter U.

Since we don't have a good way to express higher kinded types in TypeScript, C#, or Java, we can't define a construct using the type system to express a functor. Languages like Haskell and Idris, with more powerful type systems, make these definitions possible. In our case though, since we can't enforce this capability through the type system, we can think of it more as a pattern.

We can say a functor is any type H with a type parameter T (H) for which we have a function map() which takes an argument of type H, and a function from T to U, and returns a value of type H.

Alternately, if we want to be more object-oriented, we can make map() a member function and say H is a functor if it has a method map() which takes a function from T to U and returns a value of type H.

To see exactly where the type system is lacking, we can try to sketch out an interface for it. Let's call this interface Functor and have it declare map():

interface Functor<T> {
    map<U>(func: (value: T) => U): Functor<U>;
}

We can update Box to implement this interface:

class Box<T> implements Functor<T> {
    value: T;

    constructor(value: T) {
        this.value = value;
    }

    map<U>(func: (value: T) => U): Box<U> {
        return new Box(func(this.value));
    }
}

This code compiles, the only problem is that it isn't specific enough. Calling map() on Box returns an instance of type Box. But if we work with Functor interfaces, we see that the map() declaration specifies it returns a Functor, not a Box. This isn't specific enough. We need a way to specify, when we declare the interface, exactly what the return type of map() will be (in this case Box).

We would like to be able to say this interface will be implemented by a type H with a type argument T. The following code shows how this declaration would look like if TypeScript supported higher kinded types. It obviously doesn't compile:

interface Functor<H<T>> {
    map<U>(func: (value: T) => U): H<U>;
}

class Box<T> implements Functor<Box<T>> {
    value: T;

    constructor(value: T) {
        this.value = value;
    }

    map<U>(func: (value: T) => U): Box<U> {
        return new Box(func(this.value));
    }
}

Lacking this, let's just think of our map() implementations as a pattern for applying functions to generic types, or values in some box.

Functors for Functions

Note that we also have functors over functions. Given a function with any number of arguments that returns a value of type T, we can map a function which takes a T and produces a U over it, and end up with a function which takes the same inputs as the original function and returns a value of type U. map() in this case is simply function composition.

Mapping a function over another function composes the two functions. The result is a function which takes the same arguments as the original function and returns a value of the second function's return type. The two functions need to be compatible -- the second function must expect an argument of the same type as the one returned by the original function.

As an example, let's take a function which takes two arguments of type T, and produces a value of type T and implement its corresponding map(). This will return a function which takes two arguments of type T and returns a value of type U:

namespace Function {
    export function map<T, U>(
        f: (arg1: T, arg2: T) => T,
        func: (value: T) => U): (arg1: T, arg2: T) => U {
        return (arg1: T, arg2: T) => func(f(arg1, arg2));
    }
}

map() takes a function (T, T) => T, and a function T => U to map over it. It returns a lambda function (T, T) => U.

Let's map stringify() over a function add(), which takes two numbers and returns their sum. The result is a function which takes two numbers and returns a string, the stringified result of adding the two numbers:

function add(x: number, y: number): number {
    return x + y;
}

function stringify(value: number): string {
    return value.toString();
}

const result: string = Function.map(add, stringify)(40, 2);

Summary

map() generalizes beyond iterators, to other generic types.
Functors encapsulate data unboxing, with applications in composition and error propagation.
With higher kinded types, we can express constructs like functors using generics which themselves have type parameters.

Common Algorithms

Sat, 10 Aug 2019 00:00:00 -0700

Common Algorithms

A Few Common Algorithms

There are many algorithms commonly used to process a sequence of data. Let's list a few of them. We will not look at the implementation, just describe what arguments besides the iterable they expect and how they process the data. We'll also mention some synonyms under which the algorithm might appear.

map() takes a sequence of T values, a function (value: T) => U and returns a sequence of U values applying the function to all the elements in the sequence. It is also known as fmap(), select().
filter() takes a sequence of T values, a predicate (value: T) => boolean and returns a sequence of T values containing all the items for which the predicate returns true. It is also known as where().
reduce() takes a sequence of T values, an initial value of type T, and an operation which combines two T values into one (x: T, y: T) => T. It returns a single value T after combining all the elements in the sequence using the operation. It is also known as fold(), collect(), accumulate(), aggregate().
any() takes a sequence of T values and a predicate (value: T) => boolean. It returns true if any one of the elements of the sequence satisfies the predicate.
all() takes a sequence of T values and a predicate (value: T) => boolean. It returns true if all of the elements of the sequence satisfy the predicate.
none() takes a sequence of T values and a predicate (value: T) => boolean. It returns true if none of the elements of the sequence satisfy the predicate.
take() takes a sequence of T values and a number n. It returns a sequence consisting of the first n elements of the original sequence. It is also known as limit().
drop() takes a sequence of T values and a number n. It returns a sequence consisting of all the elements of the original sequence except the first n. The first n elements are dropped. It is also known as skip().
zip() takes a sequence of T values and a sequence of U values. It returns a sequence containing pairs of T and U values, effectively zipping together the two sequences.

There are many more algorithms for sorting, reversing, splitting and concatenating sequences. The good news is that, because these algorithms are so useful and generally applicable, we don't need to implement them. Most languages have libraries which provide these algorithms and more. For JavaScript, there is the underscore.js package and the lodash package, both providing a plethora of such algorithms (at the time of writing, these libraries don't support iterators, only the JavaScript built-in array and object types). In Java, they are found in the java.util.stream package. In C# they are in the System.Linq namespace. In C++ they are found in the standard library header.

Algorithms Instead of Loops

While you might be surprised, a good rule of thumb is to check, whenever you find yourself writing a loop, whether there is a library algorithm or a pipeline that can do the job. Usually we write loops to process a sequence - exactly what the algorithms we talked about do.

The reason to prefer library algorithms to custom code in loops is that there is less opportunity for mistakes: library algorithms are tried and tested, implemented efficiently, and the code we end up with is easier to understand as the operations are spelled out.

Implementing a Fluent Pipeline

Most libraries also provide a fluent API to chain algorithms together into a pipeline. Fluent APIs are APIs based on method chaining, making the code much easier to read. To see the difference between a fluent and a non-fluent API, let's take a look at a simple filter/reduce pipeline.

Let's start with a simple implementation of the two algorithms. To implement filter() we can use a generator. We take an Itreable as the input sequence and a predicate from T to boolean, and return another sequence as an IterableIterator. ItreableIterator is the return type of all generators in TypeScript. The function will simply traverse the sequence and for each element, if the predicate returns true, yield the element to the caller:

function *filter<T>(
    items: Iterable<T>,
    pred: (x: T) => boolean)
    :IterableIterator<T> {
    for (const item of items) {
        if (pred(item)) {
            yield item;
        }
    }
}

reduce() takes an Iterable as the input sequence and an initial value of type T. It also takes a function (T, T) => T which combines (reduces) two values of type T into one. This function iterates over the sequence and reduces all the elements to a single value, which it returns:

function reduce<T>(
    items: Iterable<T>,
    init: T,
    op: (x: T, y: T) => T)
    : T {    
    let result: T = init;

    for (const item of items) {
        result = op(result, item);    
    }

    return result;
}

Now let's look at how we could combine these algorithms into a pipeline which sums up all even values of an array. We will pass the array to filter() first, with a predicate which returns true for even numbers. Next, we will reduce the resulting sequence using an initial value of 0 and the function (x, y) => x + y:

const sequence: number[] = [1, 2, 3, 4, 5, 6];

const result: number = 
    reduce(
        filter(
            sequence,
            (value) => value % 2 == 0),
        0,
        (x, y) => x + y);

console.log(result);

Even though we apply filter() first, then pass the result to reduce(), if we read the code from left to right, we see reduce() before filter(). It's also a bit hard to make sense of which arguments go with which function in the pipeline. Fluent APIs make the code much easier to read. Currently, all our algorithms take an iterable as the first argument and return an iterable. We can use object-oriented programming to improve our API. We can put all our algorithms into a class which wraps an iterable. Then we can call any of them without explicitly providing an iterable as the first argument - the iterable is a member of the class. Let's do this for map(), filter(), and reduce(), by grouping them into a new FluentIterable class wrapping an iterable:

class FluentIterable<T> {
    iter: Iterable<T>;

    constructor(iter: Iterable<T>) {
        this.iter = iter;
    }

    *map<U>(func: (item: T) => U)
        : IterableIterator<U> {
        for (const value of this.iter) {
            yield func(value);
        }
    }

    *filter(pred: (item: T) => boolean)
        : IterableIterator<T> {
        for (const value of this.iter) {
            if (pred(value)) {
                yield value;
            }
        }
    }

    reduce(init: T, op: (x: T, y: T) => T)
        : T {
        let result: T = init;

        for (const value of this.iter) {
            result = op(result, value);
        }

        return result;
    }
}

We can create a FluentIterable out of an Iterable, so we can rewrite our filter/reduce pipeline into a more fluent form. We create a FluentIterable, call filter() on it, then we create a new FluentIterable out of its result, and call reduce() on it:

const sequence: number[] = [1, 2, 3, 4, 5, 6];

const result: number =
    new FluentIterable(
        new FluentIterable(
            sequence  
        ).filter((value) => value % 2 == 0)    
    ).reduce(0, (x, y) => x + y);    

console.log(result);

Now filter() appears before reduce(), and it's very clear which arguments go to which function. The only problem is we need to create a new FluentIterable after each function call. We can improve our API by having our map() and filter() functions return a FluentIterable instead of the default IterableIterator. Note we don't need to change reduce(), because reduce() returns a single value of type T, not an iterable.

Since we're using generators, we can't simply change the return type. Generators exist to provide convenient syntax for functions, but they always return an IterableIterator. What we can do instead is to move the implementations to a couple of private methods, mapImpl() and filterImpl(), and handle the conversion from IterableIterator to FluentIterable in the public map() and reduce() methods:

class FluentIterable<T> {
    iter: Iterable<T>;

    constructor(iter: Iterable<T>) {
        this.iter = iter;
    }

    map<U>(func: (item: T) => U)
        : FluentIterable<U> {
        return new FluentIterable(this.mapImpl(func));    
    }

    private *mapImpl<U>(func: (item: T) => U)
        : IterableIterator<U> {
        for (const value of this.iter) {    
            yield func(value);
        }
    }

    filter<U>(pred: (item: T) => boolean)
        : FluentIterable<T> {
        return new FluentIterable(this.filterImpl(pred));    
    }

    private *filterImpl(pred: (item: T) => boolean)
        : IterableIterator<T> {
        for (const value of this.iter) {    
            if (pred(value)) {
                yield value;
            }
        }
    }

    reduce(init: T, op: (x: T, y: T) => T)
        : T {    
        let result: T = init;

        for (const value of this.iter) {
            result = op(result, value);
        }

        return result;
    }
}

With this updated implementation, we can more easily chain the algorithms, as each returns a FluentIterable, which contains all the algorithms as methods:

const sequence: number[] = [1, 2, 3, 4, 5, 6];

const result: number =
    new FluentIterable(sequence)
        .filter((value) => value % 2 == 0)    
        .reduce(0, (x, y) => x + y);    

console.log(result);

Now, in true fluent fashion, the code reads easily from left to right and we can chain any number of algorithms that make up our pipeline with a very natural syntax. Most algorithm libraries take a similar approach, making it as easy as possible to chain multiple algorithms together.

Depending on the programming language, one downside of a fluent API approach is that our FluentIterable ends up containing all the algorithms, so it is difficult to extend - if it is part of a library, calling code can't easily add a new algorithm without modifying the class. C# provides extension methods, which enable us to add methods to a class or interface without modifying its code. Not all languages have such features though. That being said, in most situations you should be using an existing algorithm library, not implementing a new one from scratch.

A Switchless State Machine

Tue, 16 Jul 2019 00:00:00 -0700

A Switchless State Machine

This blog post is an excerpt from my book, Programming with Types. The code samples are in TypeScript.

Early Programming with Types

While working on an early draft of the book, I wrote a small script to help me keep the source code in sync with the text. The draft was written in the popular Markdown format. I kept the source code in separate TypeScript files so I could compile them and ensure that, even if I update the code samples, they still work.

I needed a way to ensure that the Markdown text always contains the latest code samples. The code samples always appear between a line containing ```ts and a line containing ```. When generating HTML from the Markdown source, ```ts is interpreted as the beginning of a TypeScript code block, which gets rendered using TypeScript syntax highlighting, while ``` marks the end of that code block. The contents of these code blocks had to be inlined from actual TypeScript source files which I could compile and validate outside of the text.

The figure shows two TypeScript (.ts) files containing code samples which should be inlined in the Markdown document between ```ts and ``` markers. The comments annotate the code samples for my script.

To determine which code sample goes where, I relied on a small trick: Markdown allows raw HTML in the document text. I annotated each code sample with an HTML comment, for example . HTML comments do not get rendered, so when converting Markdown to HTML, these became invisible. On the other hand, my script could use these to determine which code sample to inline where.

Once all code samples were loaded from disk, I had to process each Markdown document of the draft and produce an updated version as follows:

In text processing mode, simply copy each line of the input text to the output document as-is. Once a marker is encountered (), grab the corresponding code sample and switch to marker processing mode.
In marker processing mode, again copy each line of the input text to the output document until we encounter a code block marker (```ts). Once the code marker is encountered, output the latest version of the code sample as loaded from the TypeScript file and switch to code processing mode.
In code processing mode, we already ensured the latest version of the code is in the output document, so we can skip over the potentially outdated version in the code block. That means we skip each line until we encounter the end of code block marker (```). Then we switch back to text processing mode.

With each run, the existing code samples in the document preceded by a marker get updated to the latest version of the TypeScript files on disk. Other code blocks that aren't preceded by don't get updated as they are processed in text processing mode.

As an example, let's take a helloWorld.tscode sample:

console.log("Hello world!");

We want to embed this in Chapter1.md and make sure it's kept up to date.

# Chapter 1

Printing "Hello world!".

```ts
console.log("Hello");
```

This is not quite up to date, the string here is "Hello", not matching helloWorld.ts.

This document gets processed line by line as follows:

In text processing mode,"# Chapter 1" is copied to the output as-is.
"" (blank line) is copied to the output as-is.
"Printing "Hello world!"." is copied to the output as-is.
"" is copied to the output as-is. This is a marker though, so we keep track of the code sample to be inlined (helloWorld.ts) and switch to marker processing mode.
"```ts" is copied to the output as-is. This is a code block marker, so immediately after copying it to the output we also output the contents of helloWorld.ts. We also switch to code processing mode.
"console.log("Hello");" is skipped. We don't copy lines in code processing mode, as we are replacing them with the latest in the code sample file.
``` is an end of code block marker. We insert it then switch back to text processing mode.

State Machines

The behavior of our text processing script is best modelled as a state machine. A state machine has a set of states and a set of transitions between pairs of states. The machine starts in a given state, also known as the start state, then if certain conditions are met, it can transition to another state.

This is exactly what our text processor does, with its three processing modes. Input lines are processed a certain way when in text processing mode. When some condition is met (a marker is encountered), our processor transitions to the marker processing mode. Again, when some other condition is met (```ts code block marker encountered), it transitions to code processing mode. When the end of the code block marker is encountered (```), it transitions back to text processing mode.

The figure shows a text processing state machine with the three states (text processing, marker processing, code processing) and transitions between the states based on input. Text processing is the initial state or start state.

Now that we modeled the solution, let's look at how we would implement it. One way to implement a state machine is by defining the set of states as an enumeration, keeping track of the current state, and get the desired behavior with a switch statement that covers all possible states. In our case, we can define a TextProcessingMode enum.

OurTextProcessor class will keep track of the current state in a mode property, and implement the switch statement in a processLine() method. Depending on the state, this method will in turn invoke one of the three processing methods, processTextLine(), processMarkerLine(), or processCodeLine(). These functions will implement the text processing then, when appropriate, transition to another state by updating the current state.

Processing a Markdown document consisting of multiple lines of text means processing each line in turn using our state machine then returning the final result to the caller:

enum TextProcessingMode {
    Text,
    Marker,
    Code,
}

class TextProcessor {
    private mode: TextProcessingMode = TextProcessingMode.Text;
    private result: string[] = [];
    private codeSample: string[] = [];

    processText(lines: string[]): string[] {
        this.result = [];
        this.mode = TextProcessingMode.Text;

        for (let line of lines) {
            this.processLine(line);
        }

        return this.result;
    }

    private processLine(line: string): void {
        switch (this.mode) {
            case TextProcessingMode.Text:
                this.processTextLine(line);
                break;
            case TextProcessingMode.Marker:
                this.processMarkerLine(line);
                break;
            case TextProcessingMode.Code:
                this.processCodeLine(line);
                break;
        }
    }

    private processTextLine(line: string): void {
        this.result.push(line);

        if (line.startsWith("

User ID	Sessions	Page views	Total spent	High spender
1	10	45	100	Yes
2	5	10	30	Yes
3	1	5	10	No
4	2	2	0	No
5	9	33	95	Yes
6	7	5	5	No
7	19	31	95	Yes
8	1	20	0	No
9	2	17	0	No
10	8	25	40	Yes

User ID	Sessions	Page views	Total spent	High spender
1	10	45	100	Yes
2	5	10	30	Yes
3	1	5	10	No
4	2	2	0	No
5	9	33	95	Yes
6	7	5	5	No
7	19	31	95	Yes
8	1	20	0	No
9	2	17	0	No
10	8	25	40	Yes

User ID	Sessions	Page views	Total spent	High spender
1	10	45	100	Yes
2	5	10	30	Yes
3	1	5	10	No
4	2	2	0	No
5	9	33	95	Yes
6	7	5	5	No
7	19	31	95	Yes
8	1	20	0	No
9	2	17	0	No
10	8	25	40	Yes