top of page
Messaging System Improvements for Scalability and Reliability

 

The Threat Intelligence Mandiant Graph platform is designed

with a modular architecture comprising three main subsystems:

​

  • Data Import Systems - This includes Automated ingestion from various systems that collect access logs, file scans etc. and manual user actions.

  • Data processing System  -  This system interfaces with the Janusgraph and ElasticSearch systems and provides a  data abstraction layer for storing and querying the data. It also supports a custom query language that maps to Gremlin query language.

  • Data Export and Enrichment services - The stored data is analyzed and enriched using manual analysis, internal and external threat intelligence systems. The various indicators are scored and published to customers. In addition, profiles for various threat entities like indicators, threat actors, and malwares are generated and made available to customers.

​

​

The platform used a mix of technology for messaging between the Data processing systems and the Data export and enrichment services.  Kafka was used for publishing graph changes, however, gRPC was used for pushing the data downstream. This resulted in an impedance mismatch between Kafka and gRPC. Some of the consuming clients were slow consumers This resulted in a backlog of messages and a buildup of memory over time. Eventually, this results in an Out-of-Memory error for the service responsible for converting Kafka events and sending them via gRPC. This  downgraded the overall reliability of the system. Further, the system was not horizontally scalable.

​

From Stuttering Stream to Reliable Flow

​

 While the initial setup leveraged technologies like Kafka and gRPC, a mismatch between them created performance bottlenecks. Here is the break down the challenge we faced and the journey - 

​

The Technology Mix:

​

  • Kafka: A powerful tool for managing high-volume data streams efficiently.

  • gRPC: A framework enabling fast and efficient communication between applications.

​

The Friction:

​

  • Event Conversion: Data streamed through Kafka needed conversion to gRPC format before reaching clients.

  • Uneven Client Consumption: Some clients struggled to keep pace with the incoming gRPC messages.

  • Memory Overload: The conversion service became overwhelmed by a backlog of unprocessed messages, leading to memory exhaustion.

  • System Instability: Memory overload in the conversion service ultimately rendered the entire system unreliable.

 

The Fix:

​

  • Migration to Kafka: Replacing gRPC with Kafka for all communication channels ensured a consistent and efficient data flow.

  • Conversion Removed: By eliminating the conversion step, we streamlined processing and reduced overhead.

  • Enhanced Scalability: Kafka's inherent scalability allows it to adapt to clients with varying processing speeds, preventing memory issues.

  • Boosted Reliability: Removing the conversion service as a single point of failure, coupled with Kafka's built-in redundancy, significantly improved overall system reliability.

 

The Impact:

​

This transformation resulted a messaging system that is not only efficient but also dependable.

The System on average could handle a message rate of 1 million/hour, with these changes, it is able to handle 10X the traffic compared to before.

​

bottom of page