Design Search Service at Scale (Master-Worker Architecture)

An overview of how to approach a distributed microservice in terms of scalability and fault tolerance.

Kamini Kamal
Bits and Pieces

--

When designing a search or any other microservice at scale, there are a couple of questions popping up in our heads on reliability and dynamicity. There are multiple ways of achieving a fault-tolerant and reliable service. In this article, I will discuss one way to achieve this by introducing parallel computation followed by data partitioning.

Let us list down the key elements of our design:

  1. We want to make it scalable, fault-tolerant, fast, and reliable.
  2. We will follow the master-worker architecture and parallel computation and aggregate the results before sending the response to the client.
  3. We will allow the loosely coupled and scalable components of our system to register themselves to the corresponding service registry.
  4. We will partition the data across a shared database.
System architecture for a scalable Search Service
Leader and Workers interaction

The leader registers itself to the Coordinator Service Registry while the workers register themselves to the Worker Service Registry. Whenever either of the services goes down, it's deregistered from the Service Registry.

A Service Registry is a way for the Load Balancers to discover the healthy instances of a service.

The Leader receives the query and delegates tasks to each of the worker nodes after checking their health status from the Worker Service Registry.

The datasets are huge, so they are partitioned based on a key or a combination of keys. When the Worker nodes receive a task from the Leader Node, based on the task description they read and find the exact partition to the data from. This calculation can be done by the leader using a hash function.

How do we handle failures in this design?

  1. If one of the leaders goes down, the leader election brings up a new leader among the available worker nodes and registers the new leader to the Coordinator Service Registry.
  2. When the workers go down, the partitioning logic in the Leader node assigns tasks to the remaining workers so that none of the partitions are left unread.
  3. The load balancer checks the Coordinator Service Registry and routes the traffic to the available clusters. In case, we see that a few of the clusters are heavily loaded with requests, we can implement logic on the load balancer's end to route the traffic uniformly to the available clusters.
  4. The Leader will wait for the workers to get computed results for the search for a specific time period within which the workers are supposed to send back the response. If not, the leader will follow a retry mechanism, failing which the defective worker will be thrown out of the Quorum or marked unhealthy by the Leader in the Worker Service Registry.

💡 By using an open-source toolchain like Bit, you can further ensure reliability in a microservice architecture by making sure your components are made independent, then versioned, tested, and shared across different microservices, with automatic semver and a dependency graph for each. This reduces the risk of failures caused by inconsistent dependencies or code duplication. Find out more here and here.

Learn more here:

Summary

In this article, we got an overview of how to approach a distributed microservice in terms of scalability and fault tolerance. We will dive deep into certain aspects of the search service like the search algorithm, scaling up the database replicas, load balancing logics, strategies to deploy the Worker Service Registry, and many more in the upcoming articles. In one of my articles, I discussed Designing Search Service at Scale using Scatter Gather Pattern https://medium.com/@kamini.velvet/search-service-using-scatter-gather-pattern-13fab5fe71ad

Thank you!

Build Apps with reusable components, just like Lego

Bit’s open-source tool help 250,000+ devs to build apps with components.

Turn any UI, feature, or page into a reusable component — and share it across your applications. It’s easier to collaborate and build faster.

Learn more

Split apps into components to make app development easier, and enjoy the best experience for the workflows you want:

Micro-Frontends

Design System

Code-Sharing and reuse

Monorepo

--

--