Algorithms, data structures, design and code: Top k frequent queries from n machines

Let's say we have web search queries for a time window logged across n different servers. Design an algorithm to compute the top k most frequently searched queries.

Define a tuple with key and value
Get top k tuples from all the n machines
We would get > k tuples
Merge all the top k results from the n machines and build a frequency map
Find the k-th most frequent element. Let's say that tuple has key K and frequency F_K
Now your global top k list will have the kth key with frequency >= F_K
For any key K with frequency F_K, at least one machine should have frequency >= (F_K / n)
Ask all machines for keys with frequency >= (F_K / n)
Union the set of keys.
The keys for the top k most frequent elements would be within the above set
Query all the machines for keys in the above key set.
Aggregate results and find the keys with the top k frequencies.
The bottleneck in this problem is network - If you send distributed hashtables from n machines into a single machine, it is a lot of network activity and a lot of unnecessary work.

Algorithms, data structures, design and code

Monday, September 26, 2016

Top k frequent queries from n machines

No comments:

Post a Comment

About Me