Using processes for better resilience

In early 2020, I’ve read the book Programming Elixir 1.6. At that time I had one goal: to have an introduction to the actor model with a language that supports it by design, in this case Elixir. I think it was a good read and I achieved my goal, even though I didn’t feel able to design a complete system using this pattern.

However, I realized I’m using some actor model concepts for a few years now. In my previous post, I’ve mentioned types of effects produced in a CQRS/ES system, one of them is triggering new processes.

Actor model in few lines

Here’s my attempt to explain the actor model in a very simple and coarse way:

In the actor model, the main building blocks are processes (aka actors). There are different kinds of actors, some will execute business logic, some are storing data, others have to monitor their children to spawn new actors when needed. Each one of them has the capability to interact with other actors by sending messages and handling others’ messages. This mechanism provides a very high level of isolation of the execution among all actors.

Let it crash

In a regular code base, we have to put some care to error handling and design mechanisms for recovery. What is worse than an uncaught exception going through the callstack and eventually crashing our application?

The error handling philosophy is very different when using the actor model, it can be formulated as “let it crash”.

Indeed, thanks to process isolation, when an actor crash it cannot break the other actors (at least not in a way as spectacular as an exception). Then, we have to choose how to recover. There are several strategies, including spawning a new actor or choosing to do nothing. When spawning an actor, it has the advantage of starting from a known and clean state rather than an unknown and potentially flawed one.

This is one of the main reasons why systems based on the actor model are often considered to be very stable.

Triggering new processes

Back to my CQRS/ES architecture and my effects!

When handling an event, we may want to do some business operations like issuing an invoice, sending an email, executing a new command, etc. There are several issues with such operations, they can be long to execute and/or error-prone. This can affect the overall execution of our software: we don’t want it to be blocked by a bottleneck or crash because of something that could be executed asynchronously.

That’s why in my company, for most operations other than a database call, we decided to execute them in some isolated processes. To do so, when handling an event, instead of running the business operation right away, we enqueue what we call a job. Such job is then executed asynchronously and in isolation, and if it fails we just let it crash.

In case of a crash, the job is flagged as failed with the associated error code or exception attached to it. With this information, our team can monitor the production and analyze errors with less pressure (the website still behave normally for our customers, they will just receive their invoice with some delay). Some errors may be transient (like an unavailable third-party API) and jobs are just retried later, or we may need to patch our software before trying again. As every business operations are isolated in dedicated jobs, we can replay them without worrying about running other operations several times.

Throttling processes

Enqueuing these jobs gives us a lot of flexibility, we have the choice between several strategies for executing them. Some jobs may require a high priority, some can be parallelized, others may require a sequential execution. To do so, we’re using dedicated channels depending on the jobs’ types.

The principle is straightforward, we’re using job handlers to execute our jobs. For a sequential execution, we use a single instance, so we can only run one job at a time.

For parallelized execution, we want to dispatch the jobs across several handler instances. The dispatcher logic also provides flexibility, we can choose how many concurrent jobs we want to execute at a time, and how to dispatch them.

Circuit breakers

Theses channels act as buffers, there is some delay between enqueuing and execution time for a job. We can choose to increase this delay on purpose to preserve our system.

This is the core principle behind a pattern called circuit breaker. Sometimes, our jobs face a high failure rate for various reasons: a bug, an unavailable API, etc. When detecting such high failure rate, the circuit breaker opens itself and stop executing jobs (what’s the point if we know it will fail anyway?) This has the double benefit of relieving pressure on the system (or third-party API) and giving us time to investigate/fix the issue. Once the issue resolved, we can close the circuit breaker and resume jobs processing. After being open for a while, smart circuit breakers can even probe the system’s state by attempting to run a job and decide to close themselves if it doesn’t fail.

Even if in this blog post my primary focus is not about cross software integration, all these patterns are well described in the book Enterprise Integration Patterns.

The threats of asynchronous processing

Be aware there are two threats with asynchronous jobs execution.

First, the job execution can act as a bottleneck in our software: event handlers can enqueue jobs faster than job handlers can process them. This means we’ll observe increasing delays before a job is processed. For parallel execution, it can possibly be fixed by adding more computation power (more job handlers). For sequential execution, this requires some rework of the code architecture.

Second, when processing jobs, especially with parallel calls, we have to make sure we’re not overwhelming external dependencies (like APIs) capacities. In this case, these dependencies are the bottleneck of our system. This has two consequences: our system execute in a suboptimal way and we risk breaking the dependency.

From my understanding, this is because of these threats that Elixir and Erlang developers are not using asynchronous actor communication by default.

Conclusion

This pattern brings us a lot of stability to our software, it protects it from cascading failures. Thanks to this “let it crash” philosophy, we’re not forced to overcomplicate these sections of the code with a defensive coding style.

In case of failure in these processes, the overall impact on the business remains relatively low as it only delays some operations until the issue is solved. This brings more serenity for the development team and for the whole company.

Finally, having the capability to observe jobs gives us a good view of our production environment. We can see what type of processes are triggered, how many they are, how they’re distributed over time, why some of them are failing, etc. This is a key feature for operating software in a production environment.

Note to myself: maybe I will read this book (release in winter 2025): Real-World Event Sourcing

Comments

Wish to comment? Please, add your comment by sending me a pull request.