I recently had an interesting discussion with my colleague around some of the “unconventional” ways of utilizing stream processing methodologies. I briefly touched a use case I was previously involved in, on migrating a large monolithic stateful service into a set of cohesive microservices, while still having to maintain full availability and architecture integrity.
Why it’s important?
I find that maintaining high availability in a service transformation/migration scenario is highly relevant to a lot of today’s software user experiences. For example, do you think it’s painful having to wait for a banking app to finish its 4-hour maintenance – on a weekend afternoon while you are traveling oversea – before you can even transfer balances across your own accounts? Yes, I am looking at you – you know who you are 🙂 Or, imagine your work email is entirely down because of some server upgrade is going on. If you are a software engineer as I am, imagine swapping in your favorite DevOps tool here. I find it probably worth the effort to share the knowledge on how we managed to do it – on a global scale stateful distributed system, with the hope of seeing better software experience for all of us 🙂
Please do note, the solution outlined below is just one among many possible approaches. Large distributed systems are inherently complex, deciding on a solution is all about trade-offs.
First of all, let’s briefly brush up on a few keywords we are going to mainly focus on, just to make sure we are all on the same page.
Stateful Service – A service is stateful if the processing of external requests/events can trigger changes (or mutations) in its internal states. In a distributed system environment, to provide availability and consistency guarantees while able to tolerate failures, the states are often persisted and replicated across the network. In most web architecture today, most often than not, backend service stores its states in external transactional databases layer. In this case, the serving components which take requests, process business logic and return the response, are considered as stateless. The transactional layer that provides read/write guarantees is considered as stateful. For the sake of simplicity, let’s treat the union of these two layers as stateful. In other parts of the distributed system world, there are certain services simply cannot rely on external storage service. They have to maintain the full lifecycle of their own states. Such systems usually leverage in-process, in-memory storage engine and then rely on some sort of distributed consensus algorithm (i.e. PAXOS) to guarantee consistencies.
Stream processing – Processing of streams of ordered immutable events and generate one or more materialized views on top. As a matter of fact, this simple idea is already being leveraged across different communities, and they all have different names to represent the same processing paradigm. In Domain Driven Design (DDD) community, you may have heard Event Sourcing and CQRS architecture patterns. In some part of the database world, this is called Log Shipping. Some may also call it Complex Event Processing (CEP).
Commit log – The data structure for storing an ordered log of immutable events. Each event can be a logical representation of successful request or completed transactions, etc.
High availability during migration – In most cases, stateful services are infrastructure foundations of any software service product. However – unfortunately, stateful services are more susceptible to availability hit during migrations or state movements, i.e. version upgrade, cross region failover or simple deployments. In some re-architecting cases, it’s also harder to decouple a monolithic stateful service into finer-grained microservices. The challenge usually lies between the tradeoffs of complexities when continuing taking user traffic thus triggering internal state changes while the migration is taking place, or just stop accepting user traffic, so it’s much easier to move relative stable states. I think swapping an engine mid-flight is the proper metaphor to describe the complexity involved.
Define the problem
As you probably have guessed, we are now going to explore how to leverage stream processing techniques to maintain high availability during a stateful service migration using a commit log.
Let’s imagine a hypothetical scenario, where a stateful service gets close to its scale limit with all the complexity associated with managing business logic around varies types of states. The company decides to re-architect the entire backend to split the monolithic service into couple finer grained microservices, with each managing their own domain specific states separately and more efficiently (i.e. an online retail store wants to separate order management and shipping management).
Let’s add some more color to the current architecture. The monolithic service is placed behind an API gateway for service discovery and mid-tier load balancing. A sequence of user requests is routed to the stateful service to be processed through the business logic layer. Each of these requests may cause the internal states to mutate. The blue requests are associated with orders while the yellow ones are associated with shipping. The previous design somehow magically allows a clean, straightforward decoupling effort in developing the new microservices 🙂
Now our goal is to prevent downtime when bringing up the newly developed microservices while still ensure correctness of internal states.
Swap engine mid-flight
Luckily, thanks for the popular Event Sourcing pattern, the current architecture already leveraged a distributed commit log to track all successful requests. If a request failed and did not cause any side effect to states, it’s not recorded in the commit log.
The request events tracked by the commit logs are already widely used in the company:
- Recommendation algorithms
- A/B testing
- Performance monitoring
What’s wrong with leveraging this commit log for service migration? Let’s get started.
- Spin up the two microservices, with Microservice A manages only user orders, and Microservice B manages shipping states.
- The microservices are initialized not to register themselves to be discoverable so the API gateway will not send requests to them just yet.
- The microservices will initialize their internal states by replaying all the requests from the commit log using exactly-once processing semantics. Microservice A will subscribe to the commit log and only process blue requests, while Microservice B will do the same but only for yellow requests. This step may take a long time to complete, but it’s OK as long as they have the capacity to catch up.
- Once the microservices are caught up (hopefully just within a few log offset difference), a control request is manually injected into the request stream. Request M in below figure represents the injected control event.
- Monolithic service continues processing until it receives the control event M. At that point, it sends M through to commit log and then unregister itself from discovery service and proceed to reject all subsequent requests.
- Both microservices will eventually receive the control event M as they process through the stream. Then they will register themselves to become discoverable, stop subscribing to the events stream, and then start producing new commits into the stream as the gateway routes new requests to them.
- If finer control is required (i.e. failback), different control events can be implemented.
This is it. We did it – replacing the jet engine while in the air the whole time and still managed to keep the rest of the systems completely transparent to the change – described in only seven steps. Now I want to start seeing more pauseless service transformations 🙂