Monday, September 26, 2016

Top k frequent queries from n machines

Let's say we have web search queries for a time window logged across n different servers. Design an algorithm to compute the top k most frequently searched queries.
  • Define a tuple with key and value
  • Get top k tuples from all the n machines
  • We would get > k tuples
  • Merge all the top k results from the n machines and build a frequency map
  • Find the k-th most frequent element. Let's say that tuple has key K and frequency F_K
  • Now your global top k list will have the kth key with frequency >= F_K
  • For any key K with frequency F_K, at least one machine should have frequency >= (F_K / n)
  • Ask all machines for keys with frequency >= (F_K / n)
  • Union the set of keys.
  • The keys for the top k most frequent elements would be within the above set
  • Query all the machines for keys in the above key set.
  • Aggregate results and find the keys with the top k frequencies.
  • The bottleneck in this problem is network - If you send distributed hashtables from n machines into a single machine, it is a lot of network activity and a lot of unnecessary work.

No comments:

Post a Comment