Managing Domain Events in a Microservices Environment

Best practices for implementing domain events in a microservices environment and tips for event-driven architecture, including event sourcing and event bus.

Łukasz Rudnicki
Bits and Pieces

--

Photo by Growtika on Unsplash

Some time ago, I wrote a blog post about domain event management and architecture for my teammates because I was a bit tired of answering the same questions over and over again (which happened in every company I worked for ;)). The time has come to share it with the world. Actually, I’ve covered (quite generally) many more topics here, like the basic principles of microservices. I tried to explain this rather complicated part of the architecture in a very simple way, perhaps oversimplifying it. I hope that maybe my scribbles will help someone not to make mistakes that I encounter so often in many companies.

I have frequently collaborated with clients who proudly claimed to have an excellent microservices architecture, but in fact, they created a large, hard-to-maintain distributed monolith (and they blamed the whole concept for it). Don’t go this way.

There is no shame in not using microservices. Poorly designed architecture can cause more damage and sometimes prevent system development. You can find many examples of this on the Internet, e.g. one of the most famous articles about switching from a monolith to microservices and back to a monolith. These were mainly a result of a misunderstanding of what microservice architecture actually is (e.g. looking only at the system from a technical point of view, ignoring business, strategic and development flow aspects) and it’s not really much different from what used to be called SOA ;). As a result, microservice architecture has become one massive buzzword.

In this post, I have included only a small fragment of what I consider to be the best rules for designing domain events and services themselves. The topic is huge, though maybe someone will find here an answer or a clue to the questions that bother them.

Introduction

In the beginning, let’s give a definition of “microservice” (not to be confused with nanoservice, i.e. an anti-pattern, the existence of which is the result of a misunderstanding of what microservices architecture is — I’m not talking about serverless):

Microservice is standalone, independent application built on a specific domain or subdomain.

💡 Find out more about microservices here and here.

The most important rule to remember while breaking a monolith into microservice is the Single Responsibility Principle (SRP). Each service should be responsible for one business need. Responsibilities of microservices should not overlap — we must maintain full encapsulation and isolation here.

If we put SRP in the context of EDA (Event Driven Architecture), we can define a new principle like this:

Responsibility for propagating events of a specific type is assigned to a single service — Designing Event-Driven Systems by Ben Stopford

(I also recommend reading this short blog post Richy Great also)

Also, worth quoting is this excerpt from Ben’s article:

Assigning responsibility for event propagation is important because these aren’t just ephemeral events, or transient chit-chat. They represent shared facts, the data-on-the-outside. As such, services need to take responsibility for curating these shared datasets over time: fixing errors, handling situations where schemas change etc.

Overall, you should consider Kafka/RabbitMQ etc. topic as your service private database. Only one service can modify it. If you allow sending the same event type from many different applications, you’re breaking the most important SOLID rule (the one for which the “S” stands for in “Single responsibility principle”) and you will end up with a huge mess. As a result the application will be non-maintainable in the future. I’ll use an example from Martin Thompson’s blog post to better illustrate the problem:

So how can we drive towards single writer designs? I’ve found it is a very natural thing. Consider how humans, or any other autonomous creatures of nature, operate with their model of the world.

We all have our own model of the world contained in our own heads, i.e. We have a copy of the world state for our own use. We mutate the state in our heads based on inputs (events/messages) we receive via our senses. As we process these inputs and apply them to our model we may take action that produces outputs, which others can take as their own inputs. None of us reach directly into each other’s heads and mess with the neurons. If we did this it would be a serious breach of encapsulation!

Originally, Object Oriented (OO) design was all about message passing, and somehow along the way we bastardised the message passing to be method calls and even allowed direct field manipulation — Yuk! Who’s bright idea was it to allow public access to fields of an object? You deserve your own special hell.

Having multiple writers as the standard is a typical error. In computer science, there are two fundamental methods for addressing this contention, as extensively studied. One involves granting mutual exclusion to the contested resource during the mutation phase, while the other adopts an optimistic approach and replaces the changes only if the original resource remains unaltered while creating a new copy. I highly recommend reading this blog post: https://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html

Let’s imagine you are creating a new, great, innovative e-commerce platform. You want to integrate it with several payment providers. For each of them, you create a separate adapter, as a standalone service (which is also not the best architecture because it’s a straight road to nano-services hell — still this may not be the case ;) ).

So, there might be a service for cards, a service for bank transfers, mobile payments etc. You want to keep the domain simple and not add new entities, so all apps emit the same event, e.g. Payment (it could be PaymentCreated etc. — depending on chosen strategy).

However, let’s say that at some point in the future, you need to modify the message schema (e.g. add fields specific for pigeons) to include support for pigeon transfers. I believe you already get my point. If there are multiple services writing the same event, it will be much harder to schedule a non-backward compatible upgrade without a conjoined release. Additionally, due to transfer type-specific parameters, you may end up with a substantial event containing a significant number of null values.

You can find more about SWP on the Confluent blog (company who came up with Kafka):

https://www.confluent.io/blog/build-services-backbone-events/ https://www.confluent.io/blog/apache-kafka-for-service-architectures/

But wait… I have multiple services that emit the same event.

Well, actually (stating a little maliciously) it means, based on the example above, that your body has two heads. As I know, we can only meet such creatures in mythology and they were not very nice — it was even better to avoid them (dragons, hydras, cerberi). But seriously, you should rethink what you wanted to achieve with such a design. Are the expected benefits really worth the huge software development overhead and its terrible complication? However, if you have asked yourself this question, then you should ask yourself this as well:

  • Are the events broadcast by two different services actually the same events? Maybe you unnecessarily put them in the same bag, when in fact they have distinct lifecycles and can be modified separately. Additionally, many of their attributes may differ significantly from one another. Maybe they are actually different events because they are the result of processing in a different context.
  • Or should my two services be merged into one? Maybe it doesn’t make sense to separate them into completely different applications because they share a data model, part of the code, and they have the same purpose of existence. Maybe they are so dependent (inseparable) on each other that they have to share the configuration. Such an artificial division only complicates the subsequent maintenance and development, resulting in a nightmare when trying to change this part of the system. In this case, your services are not microservices, but nanoservices, and as I mentioned earlier, this is the most common reason for microservices fail (check out these two articles: https://betterprogramming.pub/microservice-macroservice-f483f9797246 and https://segment.com/blog/goodbye-microservices/ ). This means that you have designed a distributed monolith. Certainly, when referring to “microservices” in this context, we’re talking about distinct, independently running instances that can be accessed over a network. However, it’s worth noting that even monolithic systems can be organized in a “service-oriented” manner. E.g. OSGi was a popular choice for achieving this in Java, but any system with the capability to dynamically load shared objects or plugins can adopt a similar approach. The advantage of defining services in this way is that if your application experiences a traffic spike and requires horizontal scaling, the communication pathways between the services — known as service interfaces — are already clearly defined.

Remember:

Not using microservices is not a shame. The only problem is when you think you are using it and you set up your application like if you were doing so. This may lead to overkill solutions or misunderstandings in the development team. — Daniele Fontani (How Much Micro Is Needed To Create a Microservice?)

However, using domain events (remember — “domain events”, not simple “events”) is a very good idea, because it gives us a new tool that provides the next level of the extensibility (also business flexibility) and gives us power (mana) to refactor our system easier in the future.

What does it mean for our project architecture?

When I work with different clients, I usually start off by defining a set of guidelines with my architects to avoid recreating distributed monoliths in the refactoring process. Of course, some of these rules may not be applicable to your project, and you can adopt a different strategy. I’m not an oracle here — as with anything related to architecture, it depends. However, these guidelines are quite general in nature — if you feel that you have to completely ignore them (sorry to be a bit arrogant), it may be a sign that there are some problems with your system design.

In this article, I will be focusing on just one aspect of this list of guidelines — events management.

The following list is tailored to the needs of the domain/company I have recently worked on. The structure of directories and their names can be freely changed. In our case, we used Kafka encapsulated in an additional service as a message broadcaster (in this case, Allegro Hermes — next time I will tell you more about the reasons for this choice and why it is not the best one). Therefore, in the article, I refer directly to Kafka, but all the advice gathered here can be applied to any other message broadcaster (Rabbit, Kinesis, Pulsar etc).

Let’s start with our guidelines.

1. Use schema everywhere as a common description of the events.

It doesn’t matter how you describe the event structure, whether as AsyncAPI, CloudEvents or other. Do not broadcast events that do not have a defined and described structure. In our case, we used AVRO as message format and AVDL (from which AVRO schema was generated)/AVRO Schema, but you may use e.g. Protobuf. Choose one standard and force it to be used. It will save your ass many times and significantly facilitate integration with BigData systems (and won’t create a Data Swamp). I will describe the details later.

2. Each schema file must be stored in the same repository (or directory if you use mono-repo) as the service source code.

As mentioned above in our project we used AVRO. So in this case, the AVRO/AVDL files should be stored in the directory “avro” (in the root folder). AVRO schema files (not AVDL) must be published to the artifacts repository along with Docker images and other artifacts (to the “avro” subdirectory). It must be tightly coupled to the services version.

Why should we do it this way?

Events schemas should be considered as part of your code. They are an integral part of it, because they define the data model of a given service (they can be used, for example, to generate POJO classes), so they are largely related to the business logic of the application. Event schema (e.g. AVRO Schema) itself does not differ from the API — it has the same role. It is a contract between services.

The Single Writer principle says that a specific event must have its owner, which is a given application (the very concept of bounded context is very flexible and its boundaries can be frequently modified — most often it is divided, not merged). One of the reasons for doing this is that it makes schema upgrades simpler. If you had many services writing the same event to the same topic, it would be much harder to schedule a non-backward compatible upgrade without a conjoined release.

If we keep the schemas in one repository shared by many services, it means that we create a monolith consisting of many services that cannot live without each other — they are tightly coupled. Such design often finds no technical or business justification. Kafka topics should be considered as a database. It is a very bad idea to share it (schema repo and schema) among many applications, because managing the code and changes leads to a huge increase in complexity. It will very quickly cause the business logic of one service to leak into one of the other services. If you change the code of a specific service, it means that you must also change the code of the other (e.g. you must force the generation of POJO/DTO classes on the subscriber side). So, they are not independent and should properly be merged into one service.

The truth is that each service while running on the same data, may need a completely different data model. Mainly because the operations performed on it can be completely different. Otherwise, you will run into implementation problems as well as the performance of your code itself.

By keeping event schema files in your service repository (where you have your source code), you clearly define who owns the event and also simplify the introduction of any automation around domain events, message bus, deployment etc. However, if you keep all the schema files, e.g. in one common repository, you will have to check out it with each build (e.g. Jenkins). In the meantime, someone will be able to change the structure of the event (because this repository is shared with many teams) and I’m sure that your service will either not build or start. This is the charm of canonical models because you just created it and went back to software engineering from the beginning of this century (I can boldly call the canonical model an antipattern and I will fight to prove this opinion ;) ).

And once again, please remember — highly distributed systems introduce a huge amount of new classes, making it very difficult to solve problems. Breaking down services into very small entities only makes sense if your application has a very large footprint and there is a very strong performance and business justification. Let’s not go further down the path of nano-services.

3. Schema must be versioned the same as a service itself.

When you are realising the service, the schema files must be published to the artifacts repository along with e.g. Docker files and other artifacts (however, I don’t have a strong opinion if you should maintain a new type/instance of e.g. maven repository, use shared Confluent Schema Registry, Glue etc. and maintain copy of registry per environment — it depends on use case). So, schema is the part of the release. It is best to keep for it the same versioning schema as for services. This way you know that this schema version is compatible with the service, e.g. if you deploy Docker image of Service 1.1 you know that you need to deploy a schema with the same version.

Therefore, you don’t need to add a version number to the file name stored in the repository (it should be the same as topic name), just like you don’t do that when updating your application code (*.java files) or configuration files (to check history of changes use GIT). Schema is the part of your source code. Keep naming simple.

Of course, you can say that it’s stupid to upgrade the version of the service when you “only” change the schema. I do not agree. As I mentioned you should treat the schema file as part of your code. This is something that defines your data model / API. Changing its structure is most often an important business change. It’s as if you changed the structure of the database or API and forgot to notify the clients about it. Sounds a bit like failure. ;)

5. Schemas must be publicated to artifacts repository

Well, it’s self-descriptive. I know I’m repeating myself, but it needs to be re-emphasized here. Don’t forget to publish the schema files to the artifact repository (or central schema registry) after each build.

6. Do not share schemas between services.

The reason is described in point 2. If you keep dependencies between schemas, you couple the services. Each change in service A will impact service B. Otherwise, you introduce the Canonical Data Model, which in the context of distributed systems is considered an anti-pattern.

What does it mean? Avoid inheritance from schemas that are not owned by your service.

In a microservices architecture, data duplication is “good”. It helps you to keep services independent (of course it introduces a lot of other problems, but it is not the topic of this paper). Do not try to introduce OOP outside service boundaries (do not share anything between services).

Read this:

https://www.infoq.com/news/2015/04/canonical-data-models/ https://www.innoq.com/en/blog/thoughts-on-a-canonical-data-model/

7. The schema must be the same for all environments (don’t create different schemas for different environments)

It is not about versioning. It is related to “single codebase” from 12 factors app principle: https://en.wikipedia.org/wiki/Twelve-Factor_App_methodology. Do not create separate branches/forks/codebases for each business client or environment. There should be only one codebase that is configurable during deployment.

If you maintain one schema (per service version) in every environment, it will be easier to test your application and find bugs. For example, if you deploy my_service1.1 on dev and test, the AVRO schema must be the same.

8. New schema for the existing topic must be backward compatible. What does schema compatibility mean?

In our case, we had Confluent Schema Registry deployed on each environment and we used AVRO. I will refer directly to its configuration here. However, you can successfully transfer this way of managing compatibility to any other schema management system. This is a very generic description.

In my experience, it’s best to use (it works in most cases, but of course, it depends on your requirements) the FULL_TRANSITIVE compatibility mode in the Confluent Schema Registry: https://docs.confluent.io/platform/current/schema-registry/avro.html#summary

FULL_TRANSITIVE compatibility means schemas are both backward and forward compatible. Schemas evolve in a fully compatible way: old data can be read with the new schema, and new data can also be read with the last schema. For example, if there are three schemas for a subject that change in order X-2, X-1, and X then FULL_TRANSITIVE compatibility ensures that consumers using the new schema X can process data written by producers using schema X, X-1, or X-2, and that data written by producers using the new schema X can be processed by consumers using schema X, X-1, or X-2.

· FULL_TRANSITIVE: backward and forward compatibile between schemas X, X-1, and X-2

In Avro, you can define fields with default values. In that case, adding or removing a field with a default value is a fully compatible change. Compatibility rules for supported schema types are described in “Compatibility Checks in Formats, Serializers, and Deserializers”.

Let’s summarize. If you want to change an existing schema, then you can only (and it won’t break other client services, they will not have to update their code or local copy of the schema):

· Add optional fields

· Delete optional fields

If you are not able to keep these rules, the easiest way is to create a new topic (check next point). This approach causes the least problems and minimizes the risk of impact on (fail) downstream services.

9. Breaking schema compatibility — If the schema is not compatible, create a new topic, i.e. my_topic-vX and use the “grace period” approach.

Ok… There are many approaches. The one described below is not the Holy Grail, but it works in most cases and is a good starting point. Most of the time I treat it as a reference and at the same time as the simplest approach.

In the world of software development, it’s common practice to maintain backward compatibility when updating services. But sometimes, changes to the underlying message format are necessary and can’t be avoided, like when you’re trying to fit a square peg into a round hole. For instance, let’s say you want to restructure the way you model a Customer’s interactions and now you need to capture not just Orders but also Returns, Refunds, and Exchanges. Therefore, you need to change the Order (e.g. change fields types).

However, this event is consumed by other services, and such a change would be incompatible with the old schema, which can cause all sorts of headaches for the services that depend on it. To make things easier, it’s a good idea to create two separate topics — one for the old schema and one for the new schema — so that services have time to migrate over to the new schema without causing too much disruption.

There are a couple of ways to go about this. One option is to have the service publish messages in both schemas simultaneously, like a DJ spinning two records at the same time. Alternatively, you can add a process that down-converts messages from the new schema to the old one, like a translator smoothing out any misunderstandings between two people speaking different languages.

Either way, the goal is the same: to give services a chance (time) to upgrade without causing any major hiccups.

Here is a beautiful diagram illustrating this solution which I took from https://www.confluent.io/blog/apache-kafka-for-service-architectures/

I really advise you to read this post. It’s mostly about Kafka, but most of the design patterns are transferable to other message brokers.

10. Topics per domain action (domain state change) vs one topic for all actions

The answer is: it depends. However, let’s explain the differences between these types of events.

Read this: https://verraes.net/2019/05/patterns-for-decoupling-distsys-fat-event/ or watch https://www.youtube.com/watch?v=vDtK-ccQWkw

10.1. Fat events vs delta events.

Fat event

Essentially, the event contains all the necessary data that a subscriber might require, eliminating the need for subscribers to make API calls. As a result, subscribers are less dependent on the publisher’s domain during runtime. However, you should be very careful when using this pattern. Events can become very large and therefore difficult to modify. It could quickly become a mess.

Thin event

The event only consists of IDs, which means that subscribers will need to utilize APIs from other domains to retrieve the latest information required to update their local state.

I do not recommend this approach because later it will bring a lot of problems in implementation.

Delta event

It is similar to a thin event, but with the addition of information regarding the modifications made, enabling subscribers to utilize it in most scenarios for updating their local state.

10.2 What to choose?

As I said. It depends.

There is no strict rule here. In the simplest case, it is best if one service is responsible for the state of one domain (only it can modify the domain state). So we can have a UserService service that manages the user’s data (in this case, the domain is User) and its life cycle. In this case, instead of creating a new topic in the style of UserDataChanged, UserDeleted, UserCreated etc, it is better to have one User topic, where events with the same structure will be emitted (although there will be many optional fields), but differing only in the type of the action. Therefore, remember to add a field “eventType” and specify the action type there: userChanged, UserCreated etc.

However, if it turns out that it is better to break our UserService service into many small ones, where one is responsible for changing the data, the other for deleting (a very synthetic example, but maybe there would be some justification for it), then for each of these services we should create a separate topic.

It is unnecessary to conceal the fact that subscribers benefit from having a large event as they do not have to request other services for additional necessary data. Unfortunately, fat events are very prone to accidental incorrect order of delivery (Kafka only guarantees the order within partitions) — read the links from the beginning of this point.

By the way — if you want to add additional functionality, e.g. calculating the user engagement factor, it is not a good idea to expand UserService — instead, make a new service that will broadcast a new event for a new topic because this feature can be quite complicated and it does not make sense to do our UserService of the Big Dumpster or even Big Ball of Mud (user rating is something different then raw user data).

12. How work with multitenancy?

Some companies tend to create many versions of their services for some of their clients (big fishes).

I’m going way off-topic, but I have to make a small digression. If you work in a product company that has its software sold to various clients (it does not have to be cloud, but e.g. on-prem), it is worth assuming at the beginning that each of your services is multi-tenant. Of course, we can now hear people saying that “I only send my system on a pendrive to customers, I do not provide SaaS, so why complicate your life”. Sure, but believe me, a company’s business strategy can change very quickly — history is full of such cases. If at the beginning you assume that it is worth taking the management of the tenant configuration outside (externalize tenant configuration) i.e. make the tenant configuration a cross-cutting concern (a global problem solved by the platform), then you will save the company a lot of time and money in the future. Assuming multi-tenancy from the outset doesn’t require much additional effort, but refactoring in the future could be a considerable hurdle. At the beginning, you don’t have to write any service, set up Spring Cloud Config or deploy any heavyweight tools like Consul, Zookeeper etc. (configmaps may not fit here as the configuration should be loaded dynamically), but assume in the code that the tenant configuration can be provided externally.

When it comes to event management, you can stick to such simple and even obvious rules:

  • Do not create new patterns for the other Tenant.
  • Do not broadcast events for different tenants on separate topics.
  • Each event should have a default field e.g. tenantId. This field should clearly inform which customer the event occurred in the context of.

12. Managing subscriptions — I want to subscribe to the topic. How to manage schemas for consumers?

As I mentioned before, each service is a completely independent entity. We cannot create any artificial dependencies between them, especially shared schemas (see point 6).

Duplication is a “good” pattern in a microservice architecture (of course it causes a lot of other problems). If you want to subscribe to a topic and need a topic schema, e.g. to generate stubs/POJO, just copy it to your repo, e.g. to the src/main/resources directory. It is nothing bad (as opposed to using any kind of commons). Read this:

“… but wait! If other team upgrade their schema, my service will crash!”

This is not true. Read the section about backward compatibility. If you use my advice, here’s what happens:

  • If someone adds a new field, it will simply be ignored by your service,
  • If someone removes an optional field, remember that this field was optional, so you must have your service prepared for such a situation. This field may not be necessary for your service.

It is a good practice not to update the schema of the topic you are subscribing to if you don’t need it (of course, if you keep schema compatibility rules — go to point 8), e.g. you don’t need new fields. This way you will avoid unnecessary clutter in the memory of your service. Additionally, if you make your service dependent on the schema code stored in another repository (Git repository of the other service), then in the event of its updating, you will have to update the code of your service, because new stubs will have to be regenerated. Of course, you will find out about it while building the service, e.g. during the release. Nothing cool.

From monolithic to composable software with Bit

Bit’s open-source tool help 250,000+ devs to build apps with components.

Turn any UI, feature, or page into a reusable component — and share it across your applications. It’s easier to collaborate and build faster.

Learn more

Split apps into components to make app development easier, and enjoy the best experience for the workflows you want:

Micro-Frontends

Design System

Code-Sharing and reuse

Monorepo

--

--