Let's say we have web search queries for a time window logged across n different servers. Design an algorithm to compute the top k most frequently searched queries.
- Define a tuple with key and value
- Get top k tuples from all the n machines
- We would get > k tuples
- Merge all the top k results from the n machines and build a frequency map
- Find the k-th most frequent element. Let's say that tuple has key K and frequency F_K
- Now your global top k list will have the kth key with frequency >= F_K
- For any key K with frequency F_K, at least one machine should have frequency >= (F_K / n)
- Ask all machines for keys with frequency >= (F_K / n)
- Union the set of keys.
- The keys for the top k most frequent elements would be within the above set
- Query all the machines for keys in the above key set.
- Aggregate results and find the keys with the top k frequencies.
- The bottleneck in this problem is network - If you send distributed hashtables from n machines into a single machine, it is a lot of network activity and a lot of unnecessary work.
No comments:
Post a Comment