Engineering

Why AI-Powered Routing Beats Static Rules at 10M+ msg/sec

After years of building rule-based routers, we rewired OwlMQ's core to use embedding-based intent inference.

Priya Menon

Principal Engineer

January 15, 2025

12 min read

AI RoutingPerformanceArchitecture

When we started building OwlMQ, routing was simple: a message arrives on topic orders.created, it goes to the payment-service consumer group. End of story. Rule-based routing works fine at modest scale — until it doesn't.

The problem with static rules

At around 2M messages per second, our rule evaluation became the bottleneck. We had 14,000 routing rules across 400 tenants. Every incoming message had to traverse a trie of conditions. The P99 routing latency climbed to 18ms — unacceptable for a system promising sub-5ms end-to-end.

More importantly, static rules are brittle. A new team adds a topic, forgets to configure routing, and messages silently lag. The on-call gets paged. The fix is a config PR. It's toil that compounds.

IAMP: the foundation for AI routing

The key insight was that messages already carry intent — it's just implicit in the topic name and payload structure. IAMP (Intent-Aware Message Protocol) makes it explicit. Every message now includes an intent field:

{
  "intent": "payment.process",
  "payload": { "orderId": "ord_123", "amount": 99.99 }
}

This single field unlocks everything. Our routing model is a fine-tuned embedding model (based on all-MiniLM-L6-v2) that maps intents to consumer groups using cosine similarity. New intents are handled automatically — the model finds the closest matching consumer group without any config changes.

Performance: the numbers

After shipping AI routing to production:

Routing P99 latency dropped from 18ms to 1.2ms
Zero routing misconfigurations in 90 days (vs. 23 the previous quarter)
Throughput ceiling increased from 2M to 12M msg/sec on the same hardware
Cold-start routing for new intents: under 200ms (model inference + cache warm)

The architecture

The routing layer runs as a sidecar to each broker node. Embeddings are computed once per unique intent and cached in a Redis hash with a 24-hour TTL. For known intents, routing is a hash lookup — O(1), under 50μs. For unknown intents, we invoke the inference engine, which runs on a co-located GPU slice.

We maintain a fallback path to static rules for latency-critical intents where the SLA is under 1ms. These are configured explicitly and take priority over the AI router.

Lessons learned

The biggest surprise wasn't performance — it was reliability. Because the model generalizes, typos in intent tags ("payment.procsess") still route correctly. Because routing decisions are logged with confidence scores, debugging is dramatically simpler. And because new topics are handled automatically, we've reduced routing-related incidents to near zero.

The lesson: AI routing isn't magic. It works because we gave messages a structured semantic layer (IAMP) that the model can reason about. Without that, you're just doing keyword matching with extra steps.

← All posts

Share on X