Multi-Tenant | research

Multi-Tenant Data Fairness:
"Noisy Neighbor Problem"

Multi-tenancy allows a cloud service provider to serve multiple customers using the same resources. A single instance of the software runs, however, for each of its tenants, it appears as if there is a dedicated share of the resources. The data for each tenant is kept separate from other tenants.

This provides several advantages for the service providers and for the customers -

Lower Cost - As some of the resources are shared across tenants, cloud providers can offer services at a lower cost when compared to a dedicated environment.
Flexibility - Tenants don't own any hardware resources and as such can add or remove services based on usage and make faster changes as their business evolves.
Mobility - Applications in the cloud can be accessed from anywhere, allowing for easier and greater employee flexibility.
Better utilization of resources - Compute resources are shared across all tenants which leads to better utilization.

However, along with these advantages of cloud services there are few disadvantages -

Security - As the data is hosted in the cloud outside of owner's control, it can be seen as having additional risk of theft, loss or hacking.
Compliance - Compliance laws regarding privacy and ownership of data presents an extra layer of challenge when user data is hosted in the cloud.
"Noisy Neighbor Effect" - If there is a sudden surge of incoming messages from one Tenant, the processing of messages for all other tenants can get impacted due to the shared nature of resources.

In this article I go over the details of how we designed and developed a solution that mitigates the impact of the "Noise Neighbor Effect".

Platform Architecture

Almost all Software as a Service (SasS) platforms provide support for multiple tenants. Multiple tenants use shared services as part of the data flow through the application. The Figure below provides a simplified data flow diagram. The user data first traverses via a gateway, then processed at compute nodes, caches and message brokers, before data is stored in the tenant's database

Shared Resources:

As user data transverses through the system, it is processed by a number of services - Gateway, Application Pods, Message Broker and Caches that are shared across tenants. The sharing of resources can lead to the “noisy-neighbor” effect as described in [1][2], where one tenant can consume a large share of resources, impacting the performance of all other tenants.

Data Fairness

A Message Broker decouples processing of application of data. Message brokers are one of the first shared resources, and it often becomes the first place where a bottleneck occurs, when a tenant starts publishing data at an extraordinarily high rate.

To achieve fairness of data usage for all tenants, a throttling capability was added to the platform. The mechanism used for throttling is to calculate the request rate. If the rate is above a pre-configured threshold, the messages are held in the tenant specific ‘Waiting Room’. These messages are then delivered to the consumer for processing after a delay. This achieves the primary goal of load-leveling under surge. This is similar to the concept of a ‘Waiting Room’ when waiting for a service. The high-rate tenant’s data is sent to this tenant specific waiting room where a delay is calculated based on the amount of data seen. The data is released to the message broker from the waiting room after the delay time expires.

The logic for load-leveling is as follows:

The algorithm keeps a count of the number of messages per tenant, per time period, going to the message broker.
- If this is above a configured threshold value, the message is diverted to a waiting room.
Each message has a fixed delay compared to previous message publish time to the main message flow from the waiting room .
Once the ‘Delay Time’ for a message in the waiting room is met the message is diverted back to the Message Broker and follows the regular processing logic.
Once the incoming message rate for the high-rate tenant drops back below the configured threshold value, the ‘Delay Time’ is reduced, eventually when the ‘Delay Time’ drops back to zero, the messages are no longer sent to the waiting room and are no longer subject to delays.

Technical Details

The SaaS solution I worked on uses Kafka as the primary Message Broker. One of the limitations of Kafka is that there is no native multi-tenant support. The initial thought to solve this problem was to use a Backup Kafka topic for each of the current topics and use that as a 'Waiting Room' topic. However, it presented additional challenges on how to delay the processing of the messages so as to reduce the load on the main topic and on the Kafka broker itself.

Azure Service Bus (ASB) and Azure Storage Queue (ASQ) have a 'Visibility Delay’ feature that is not present in Kafka. This feature allows messages to be hidden and not made available to consumers for the duration set. ASB has a concept of scheduled messages - Azure Service Bus message sequencing and timestamps - Azure Service Bus ; Using this feature, buffering can be achieved by simply scheduling the message to appear at a delayed time.

An advantage of using Azure Cloud technology, is the reduction of load on Kafka Cluster. Under a surge of incoming messages, Kafka Broker’s themselves can start experiencing overload; By moving the throttling mechanism to Azure, the load on the broker is reduced and the rest of the system can continue to function with minimal impact. For these reasons I looked at a 'Hybrid' solution of brokers i.e. use Azure technology in conjunction with existing Kafka brokers to address the 'Noisy Neighbor problem'.

ASQ vs ASB:

I evaluated Azure Storage Queue(ASQ) and Azure Service Bus (ASB) to see which would best serve the use-case we were trying to solve - introduce "load-leveling" or "buffering" to the "noisy-neighbor traffic". Microsoft documentation provides detailed comparison for ASQ vs ASB. After evaluation, we decided to go use ASQ as the solution for buffering the traffic from the tenants who exceed the configured threshold. The main of reasons for this being:

Ease-of-Use : Number of services already were using Azure Storage account for Blob Storage, and Tables. Adding another service that is already present in the account required minimal effort for Operations and Development teams to provision and start using.
Cost: The cost of the new service was minimal given the storage account was already in use.

ASQ has a limit of 64KB message size. By storing the message-sizes larger than that in the Blob-Storage and storing the pointer to the message in ASQ, this limitation was addressed.

Resiliency4J - Ratelimiter:

The first step of managing high traffic tenants is to define the criteria to detect the high-rate tenants. This is typically done via configuration based on the capacity of the system. A threshold limit of the number of messages per minute from a tenant was set.

Resiliency4J java library provides mechanisms to rate limit calls to an API. Using this library we can define the allowed threshold. The ratelimiter uses the concepts of ‘count of available permissions ‘ per time period. Once the number of permits are consumed, the library will throw a “Request Not permitted’ exception that can be handled to take appropriate action. In our case the subsequent messages were sent to ASQ. To maintain some level of order of arrival of messages, the message for the tenant was diverted to ASQ, till the queue was completely drained.

Shared Nothing Architecture:

In a SaaS microservice based platform, scalability and reliability are achieved by having multiple instances of the same service. A gateway routes the incoming requests to the various instances using a simple round-robin algorithm. In this deployment, tenant requests are spread across all the instances. Setting a global tenant-level rate-limiter configuration would have required all instances of a service to access a shared common data. This introduces additional complexity. Instead of this, the approach taken is to divide the total rate-limit across the number of instances and not require any shared data. This has some down side as the number of instances can vary, however, a cumulative approximation of the rate-limit is only required rather than an exact number. The trade-off of a simpler architecture was an acceptable trade-off for using an approximate value for rate limiting.

Challenges:

The use of Azure Sdk brought in the azure-core library that in turn had jackson-dataformat-xml library, which in-turn introduced a new HTTP message converter. With the introduction of this converter, determination of default content-type for RestTemplate Calls defaulted to XML. The issue is reported here - "Azure library transitive dependency issue - https://github.com/Azure/azure-sdk-for-java/issues/7694"

To overcome this a library was created that reordered the message converters and moved the XML converter to the end. This Stack Overflow reference provides the general idea of the workaround implemented in the library - https://stackoverflow.com/questions/57706610/how-to-set-default-messageconverter-to-json-with-jackson-dataformat-xml-added

Metrics & Dark Launch:

As part of any feature development, recording metrics that provide observability is key to understanding how the feature is functioning. To this end, Micrometer metrics were recorded to get a count of how many messages were sent to the ASQ, received from ASQ, the amount of time in the queue and average message size.

Dark launch sends a copy of the real production traffic to the new service. The results from the new service are discarded. The goal is to see if the new service is ready to handle the real-world traffic and would function without causing major outages on launch.

Once the ‘load-leveling’ feature was ready, the feature was introduced as a ‘Dark-Launch’ feature. A new service was created, that consumed from the same Kafka topic that had high-volume and posted to a new Dummy topic. The Dark-launch service configuration was set to trigger the feature when the posting rate to the dummy topic exceeded the threshold for a given tenant.

Using Micrometer metrics that were created for this feature, the ‘Dark-Launch’ of the feature was monitored. Using these metrics, tweaks to the configuration were made to adjust the load-leveling detection and delays.

Conclusion:

‘Noisy-Neighbor’ problem is a common occurrence in a Multi-tenant environment. This problem can come in different forms and causes resources to be used by one tenant and leaving all other tenants ‘Starved’. Using the ‘Waiting-Room’ concept for holding messages from a high-rate tenant provides an excellent solution to mitigate the impact of a High-rate tenant on the rest of the system. Further, in this solution we used a mix of technology: Azure Storage Queue and Kafka to reduce the system load. In general, isolating a tenant as soon as the resource utilization exceeds a pre-defined threshold would reduce the impact of ‘Noisy-Neighbor’.

References:

Mutlitenancy: https://www.cloudflare.com/learning/cloud/what-is-multitenancy/
MultiTenancy: https://en.wikipedia.org/wiki/Multitenancy
Azure Storage Queue vs Azure Service Bus https://docs.microsoft.com/en-us/azure/service-bus-messaging/service-bus-azure-and-service-bus-queues-compared-contrasted
Resilency4j Ratelimter: https://resilience4j.readme.io/docs/ratelimiter
Resilency4j Ratelimter: https://reflectoring.io/rate-limiting-with-resilience4j/
Shared-Nothing Architecture: https://en.wikipedia.org/wiki/Shared-nothing_architecture
Azure library transitive dependency issue: - https://github.com/Azure/azure-sdk-for-java/issues/7694
StackOverflow reference : https://stackoverflow.com/questions/57706610/how-to-set-default-messageconverter-to-json-with-jackson-dataformat-xml-added
MicroMeter Metrics https://spring.io/blog/2018/03/16/micrometer-spring-boot-2-s-new-application-metrics-collector
Dark Launch https://cloud.google.com/blog/products/gcp/cre-life-lessons-what-is-a-dark-launch-and-what-does-it-do-for-me

Multi-Tenant Data Fairness: "Noisy Neighbor Problem"

Platform Architecture

Shared Resources:

Data Fairness

Technical Details

Multi-Tenant Data Fairness:
"Noisy Neighbor Problem"