The starting pile is every chat, call, and message between fifty million people and the various AI products our customers run. The chats are long and unstructured. A single chat often jumps between five different things. The job is to make sense of all of it, automatically, without anyone hand-labeling anything.
Here is what we actually do, in order.
Cut conversations at the seams
Most chats aren't about one thing. Somebody asks about pricing, then a feature, then a bug, then asks for a refund. If you treat that as a single thing, you get nothing useful out of it.
So we read every chat from top to bottom and ask, at each break, whether the topic just changed. Where the answer is yes, we put a knife. A chat that bounced between five things becomes five separate pieces, and each piece is small enough to mean one thing.
The hard part is the line itself. The point where one topic ends and another starts looks different in every chat. A formal bot reads differently from a casual one. Instead of picking one rule for everyone, we calibrate the line for each individual chat against a small set of chats a human has cut the right way. Nobody picks the threshold by hand.
Boil each piece down to a sentence
A piece of chat has a lot of filler in it. "Got it, one moment, let me check that." The filler takes up just as much room as the actual point. So we boil each piece down to one sentence about what the person was trying to do. From here on, everything downstream sees the sentence, not the raw chat.
Turn each sentence into a fingerprint
Now we turn each sentence into a long string of numbers, about four thousand of them. Two sentences with similar meaning end up with similar numbers, even if they share zero words. "I want my money back" and "Can I get a refund" come out close. "I want my money back" and "Where do I find the menu" come out far apart.
Once a sentence is a string of numbers, you can do math on it. You can ask: how close are these two? How close are these thousand? That is the trick that makes the rest of this possible.
Stack the fingerprints into folders
We end up with a hundred and fifty million strings of numbers and we want them organized. We do this in three layers, like folders inside folders.
At the top, a dozen broad categories. Inside each one, a few dozen sub-folders. Inside those, the actual buckets, where each bucket is one specific behavior. "Refund because of a billing error." "Refund because a feature was promised and removed."
The grouping isn't done by a person. We let the math find tight neighborhoods, then find tight neighborhoods of those neighborhoods, and so on. Anything that doesn't fit a neighborhood gets held aside, on purpose, for later.
Name each folder
Every bucket needs a name a human can scan in one second and know what is inside. The lazy way is to grab a few random sentences from the bucket and ask a model to summarize them. The names that come out are vague. A few random sentences from a bucket of a hundred thousand miss most of what is actually in there.
So we don't pick randomly. We pick the most central sentence in the bucket, then the sentence furthest from it, then the next sentence furthest from those two, and so on. This pulls in the edges, not just the middle. Names made from those sentences come out specific. The page says "refund disputes," not "customer inquiries about financial transactions."
Send new chats to the right folder, right now
The big folder-building job runs once a week. New chats don't wait for it. Every new chat needs to land somewhere within seconds.
So when one shows up, we ask: which existing bucket does this look most like? If there is a clear winner, the chat goes there. If the answer is ambiguous, the chat sits in a holding area instead. Skipping is allowed. Guessing is not. The whole decision takes seven thousandths of a second.
Watch the holding area for new behavior
The holding area is not a graveyard. Every hour, we look at it and ask whether the chats in there are starting to resemble each other. If they are, a new bucket forms there on its own.
This is where new behavior shows up first. Somebody shipped a feature on Monday, and by Wednesday people are using it in a way nobody planned for. That appears in the holding area before it appears anywhere else, because, by definition, new behavior doesn't fit any existing bucket.
Compare last week's folders to this week's
Every week, we redo the whole folder structure from scratch. Then we line last week's folders up next to this week's and ask: what carried over, what is new, what disappeared, what split, what merged?
When two confusion buckets merge into one smaller bucket the week after a redesign, the redesign worked. When a refund bucket splits in two the week after a pricing change, the pricing change broke something specific, and we can read which thing. This is a different kind of signal from a number going up or down.
Every dial tunes itself
Every threshold in this system was once picked by a human. Where to cut a chat. When to call a routing decision ambiguous. When to declare two buckets the same thing. None of those numbers ever survived contact with a fintech support bot and a developer tools agent in the same week.
So we stopped picking them. Every dial gets re-tuned automatically against a set of chats a human has hand-graded. The grading set keeps growing. The dials drift on their own. The only thing we still pick by hand is the question being asked. The numbers are downstream of that.
Putting it together
Cut conversations at the seams. Boil pieces into sentences. Turn sentences into fingerprints. Stack the fingerprints into folders. Name the folders. Route new chats into them in real time. Watch the holding area for new behavior. Compare folders week over week. Tune every dial against human-graded examples.
The other post tells you the names of the algorithms we use. None of them are new. The work was getting them to cooperate at this scale. This page is what the cooperation looks like.