Migrating Python to Rust: Channel Groups Engineering Notes
In the realm of real-time communication, the necessity of scalability and adaptability is paramount. Rust is a key aspect to how we are accomplishing this. We, at PubNub, view this as an engineering challenge, constantly refining our systems to accommodate emerging trends in digital interactions. Our Channel Group infrastructure is one such area where we have instituted an expansion, not just in channel volume but also in performance gains.
It is imperative that we keep in mind the sheer scale of our operations. We connect over a billion devices sending trillions of messages. These devices each have non-homogeneous subscription patterns. For example, inline-channel multiplexed subscriptions, wildcard channel subscriptions, filter expressions, and groups of channels. This is why it is essential that we maintain our focus and ensure high performance, scalability, and reliability.
This document will discuss many tactical considerations. First we can describe our overall approach. We have built a globally distributed database that stores a list of your channels. These groups of channels can be identified by a name "channel-group-id". In the PubNub SDK, you can subscribe to that identifier to connect to the list of channels represented in that group.
It is possible, and encouraged, to modify a group from either your server code or client. This allows you to maintain an active subscription without having to alter the client connection directly. This provides you with the ability to remotely manage channel message delivery in real time. Users can share the same group, allowing you to change the feeds the audience is receiving. You can direct the flow of messages this way.
Before: Client ══► Subscribe to Channel Group ══► (Python App: Channel Group Lookup) ══► Cassandra DB After: Client ══► Subscribe to Channel Group ══► (Rust App: Channel Group Lookup) ══► DynamoDB Global Tables Legend: • ══► Arrow indicating flow • (Parentheses indicating Docker container)
From a service architecture perspective, the Channel Group list is globally replicated to all availability zones that PubNub operates on AWS. Currently, we have a Python application powered by a Docker container running in Kubernetes. We are replacing this Python Docker container with a Rust container. We are replacing the database too. Moving away from Cassandra and using DynamoDB with global table replication enabled.
Technical Impacts Increasing Channels per Group
In a recent brainstorming session, our engineers tackled the pressing issue at hand for the increase of supported channels per group from 2,000 to potentially 5,000. This decision, of course, wasn't taken lightly, considering an escalation of this magnitude might lead to higher latency, substantial increase in memory usage, or other unforeseen complications.
Number of Channels per Group 2,000 5,000 Impact Level Low Impact High Impact? Considerations: - Latency - Memory Usage - Queue Depth - Multiple Subscriptions - C Code Optimization - Channel Group Events Key Details: - Moving from 2,000 to 5,000 channels per group - Main concerns are latency, memory usage, queue depth - Solutions involve C code changes, multiple subscriptions, channel group events
The crux of this concern rested on the current subscription model. No matter how elevated the channel count, a single subscription call only returns messages up to the account key's configurable queue depth. While this is a non-issue typically as the rate of message delivery per connected device is well within limits, we see that adding even more channels can increase the likelihood of needing to make adjustments for large channel group use cases. This means at high volumes, you can ask the PubNub team to increase your queue size.
Another option is to use separate subscriptions, spreading out the rate of message delivery as multiple channels are concurrently publishing. This is important to note as scaling up the channel limit could potentially exacerbate this issue. This is an outcome we aim to mitigate. We have already done this successfully with large volume message customer use cases. And we are keeping our focus to ensure we maintain this advanced capability advantage we are offering to our customers.
We are taking a look at the our globally proactive-replicated message bus, our core IP we developed over the last decade in C. We are strategically considering the mechanics behind large-scale subscriptions. Examining and modifying the C code is a potential solution to enhance data volumes while accommodating an increase in channels without compromising system performance.
Channel Groups Performance Tradeoffs
Highlighting the need for flexibility in managing Channel Groups, Craig Conover introduced the concept of Channel Group events that mimic Presence API events. This mechanism would keep track of ongoing modifications to the Channel Group and inform users upon reaching channel limits or when a channel is added or removed. This addition can improve data management by providing a more precise understanding of channel behavior and increased control to improve data transmission.
Number of Channels per Group 1 2,000 10,000 50,000 Operation Complexity (color) Green = O(1) (Best) Yellow = O(2) (Good) Red = O(n) (Concerning) Key Details: - Currently support up to 10 groups x 2,000 channels = 20,000 total - O(1) complexity for user subscribing to a group - O(2) complexity for publishing with channel replication - More groups and channels causes performance issues - Major architecture changes needed beyond 10 groups
An interesting perspective was raised regarding the achievable balance between the number of channels and the associated performance implications. Presently, we support up to 10 Channel Groups with 2,000 channels each per user subscription instance. Aggregating to a total of 20,000 channels per subscription. That’s a lot of channels!
This can be extended even further by creating a second or third subscription on the user device. This offers customers the freedom to subscribe to a lot of channels using Channel Groups with a potential 20,000 channels per subscription. We are careful in this matter and reviewing the potential strain on database performance and the requirement for architectural changes were paramount considerations in this scenario.
Operation Complexity in our C Codebase
An overview into performance considerations for our core message bus's capabilities today is necessary. Outlining an 'O(1)' operation complexity in PubSub from a perspective of a user subscribing to a Channel Group with 2,000 channels. This representation simply translates as one channel to the user, providing the most efficient scenario.
Our core message bus will treat the group as a single channel, offering ideal performance. Also from a replication perspective as a user Publishes data into a channel, having a channel that exists in a multiplex adds an extra memory pointer. A message pointer reference is captured in the channel group based on the originating source channel.
The good news is that this operation scales well to an 'O(2)' operation complexity. And in a vast majority of cases, the slight increase in operational complexity is already commonplace and performs admirably today with our current Channel Group capabilities.
Operational Complexity O(n) ^ | | ✅ O(3) | | ✅ O(2) | | ✅ O(1) | ----------------------------------------------- 5 2,000 10,000 50,000 Number of Channels per Group O(1) + O(2) = O(3) max
Optimizing the database solution is a key goal. Our current focus allows us to use a half megabyte worth of channel names. This can easily extend beyond 5,000 channels. And Rust is an important aspect for this. We Want to ensure we maintain our blazing fast high performance for message delivery under all possible scenarios of subscription patterns. Resourceful channel naming, short and distinct, is meaningful while making maximum use of available resources.
The journey to enhance the Channel Groups infrastructure with Rust at PubNub stretches beyond expanding our numbers. It's an intricate dance of managing operational complexities, data loss prevention, optimization of data replication solutions, and latency reduction. We have not yet released this update powered by Rust. We are looking at the upgrade being completed this year. As we continue to innovate and improve our real-time network, we are committed to ensuring a seamless, scalable, and efficient service for our users.