Decision Intelligence

Why We Didn't Use AI for Channel Pattern Recognition

0 MIN READ • Mateusz Chrapczyński on May 2, 2025
Why We Didn't Use AI for Channel Pattern Recognition

In the previous blog post, we explored the concept and benefits of "the forest and the trees" in analytics, highlighting a new Group by Channel Patterns feature in PubNub Insights. This feature enables you to shift seamlessly between big-picture trends and granular individual metrics by aggregating individual metrics into group metrics based on channel patterns, allowing you to optimize your platform and establish stronger decision-making strategies.

While building this feature, we took into account the following factors:

  • Our customers span across a wide range of industries, each with their own unique, non-standardized naming conventions for channels and use cases.

  • The volume of channels for each customer varies significantly, ranging from hundreds to hundreds of thousands.

  • We wanted to ensure that real-time performance at scale remained fast across a wide range of customer data volumes.

  • The majority of our users are also not familiar with their channel patterns.

What is channel pattern grouping?

In PubNub, channels can follow structural naming conventions—like chat-private-user-123, chat-private-user-456, or live-video-room-abc. These patterns represent functional themes or workflows and are all defined by the customers. Up until now, analyzing them meant diving into each channel and manually piecing together insights. You’d add the metrics from every channel that starts with chat-private-user.

We already show metrics for each channel in Insights, so now the question is how to group the metrics from these individual channels based on patterns that make sense for each customer and to help prevent any human oversight.

Behind the scenes: Requirements and options explored

When developing this feature, an important part was to offer suggested patterns as helpful starting points for customers to dive into their data. This means we had to look for ways on how best to identify patterns to suggest from each customer’s data with the following requirements:

  • Performance: The suggested pattern and the metrics generated after selecting patterns to group by should load fast and allow for easy problem-solving if any issues arise.

  • Processing cost: With the volume of channels being processed, the processing cost has to be kept as low as possible and prevent any surprises for future forecasting.

  • Available immediately: The suggested patterns should be available to everyone immediately without additional input from the user.

Based on these requirements, we explored several approaches to detect meaningful patterns in channel names. Options explored included—Suffix Trees / Suffix Arrays, Naïve Log/Template Discovery, Trie-Based Approaches, and Tokenization + TF-IDF. The decision was to use Tokenization + TF-IDF because it met our requirements of performance, processing cost and immediate availability.

Why not use Artificial Intelligence?

One would immediately think to use a Large Language Model (LLM) for getting these pattern suggestions. After all, LLMs can understand context and token relationships better than human intelligence, offering potentially deeper insights.

However, we discovered that LLMs and deep learning were overkill and opted not to have AI used in production for two key reasons:

  1. Customer Privacy & Consent: Many of our customers are cautious about how their data is used — especially when it involves AI technology. Using an LLM would require explicit consent to share data for processing and machine learning, adding unnecessary friction.

  2. Overkill for the Problem: While LLMs are powerful, they’re not always the best fit. AI algorithms take longer to process the large number of channels across their neural network and then provide pattern suggestions. The cost is also significantly higher since they require very intensive processing GPUs and TPUs.

It’s worth noting that LLMs are just one kind of AI model. Other AI systems and AI tools that can extract patterns — often called extractor models — are lighter, faster, and more purpose-built for specific tasks like this. While LLMs — and even lighter-weight entity extractor models — might produce high-quality results, quality alone isn’t our only qualification criteria. We need results that are not just good, but fast at scale. When you’re working at the scale of billions of channels, speed becomes a non-negotiable requirement.

That said, we may revisit LLM-powered suggestions as an optional, opt-in AI-powered feature in the future.

In case you’re curious, here’s one of the tests we did via LLM. For the following sample list of individual channels:

The LLM shows the following AI-driven identified patterns:

Tokenization + TF-IDF: The winning combo

The most effective solution we found combines tokenization with TF-IDF (Term Frequency-Inverse Document Frequency). This approach analyzes the structure of channel names by breaking them into "tokens" - individual components separated by delimiters like dashes or underscores.

For example, a channel like chat-private-user-848-user-173 might be tokenized into:

  • chat

  • private

  • user

  • 848

  • user

  • 173

Then, TF-IDF helps highlight the most significant tokens across your entire channel set, surfacing recurring patterns. The result? Suggestions that often align with how you've logically structured your channels, even if you didn’t consciously plan it that way. Our TF-IDF + tokenization approach is lightweight, transparent, and fast — and it gives great results without the overhead of integrating AI-driven models.

The conclusion? Tokenization works.

The quality of pattern recognition depends almost entirely on how well we break down the channel names. That’s why we invested time fine-tuning our tokenization logic with smart defaults and RegExp handling.

Here’s one of the tests from the Tokenization + TF-IDF process.

The following is a sample list of individual channels from some training data:

TF-IDF + Tokenization shows the following identified patterns:

Other Methods We Explored

The other alternatives we tested:

  1. Suffix Trees / Suffix Arrays: A Suffix Tree is a compressed trie (prefix tree) of all suffixes of a string. For example, for chat-private-user-848-user-173: You build a tree where each path from root to leaf represents a suffix (e.g., chat-private-user-848-user-173, hat-private-user-848-user-173, at-private-user-848-user-173, etc.).

Results: Suffix trees are not feasible for very large datasets. Both Suffix Arrays & Suffix trees take a lot of processing time and memory. Hence, we decided not to go with them.

  1. Trie-Based Approaches (Common Prefix/Suffix Discovery): A Trie (pronounced "try") is a tree-like data structure used to store strings so that:

    • Each node represents a character.

    • Each path from the root to a leaf represents a word (or substring).

    • Common prefixes are shared, so it’s very space-efficient when many strings start the same way.

Results: This method uses a lot of memory, has deep tries, and is not scalable for massive datasets. It didn’t yield useful or consistent results for real-world channel data, so it was rejected.

  1. Naïve Log/Template Discovery: It’s a simple approach where you manually or heuristically group log lines into templates by:

    • Identifying constant parts of log messages.

    • Replacing variable parts (e.g., timestamps, IDs, user info) with placeholders like <*>.

Results: This method is not suitable for handling diverse data and scales poorly for large data. It didn’t provide any meaningful patterns during testing, so it was rejected.

Not one-size-fits-all — And that’s okay

Not every suggested pattern from Tokenization + TF-IDF will be perfect for every customer. Channel naming conventions vary wildly, and while our algorithm works well in most cases, edge cases do exist. That’s why we’ve built flexibility — you can add your freeform patterns or ignore suggestions that don’t make sense for your unique data.

We’re continuing to test and iterate based on feedback, and we’re always open to ideas that make insights more useful to you.

Try it today

The Group by Channel Patterns feature is now live in PubNub Insights. Jump in and start exploring your data in a whole new way.

If you're a current Insights user, go to the Admin Portal, navigate to Insights dashboards, and then view the Top channel metrics on the Channels dashboard to use this feature.

If you're not an Insights user, start a two-week free trial here and view the Insights dashboards to use the group by patterns feature.

Let us know how it works for you, what you discover, and how we can keep improving. Your feedback shapes our roadmap.