"Finding Heavy Hitters from Lossy or Noisy Data"

Russell Impagliazzo

UC San Diego

Monday, August 5st, 2013, 2:00 pm

EBU3B, Room 4217

Abstract:

Motivated by Dvir et al. and Wigderson and Yehudayoff we examine the question of discovering the set of heavy hitters of a distribution on strings (i.e., the set of strings with a certain minimum probability) from lossy or noisy samples. While the previous work concentrated on finding both the set of most probable elements and their probabilities, we consider enumeration, the problem of just finding a list that includes all the most probable elements without associated probabilities. Unlike Wigderson and Yehudayoff, we do not assume the underlying distribution has small support size, and our time bounds are independent of the support size.

For the enumeration problem, we give a polynomial time algorithm for the lossy sample model for any constant erasure probability $\mu < 1$, and a quasi-polynomial algorithm for the noisy sample model for any noise probability $\nu < 1/2$ of flipping bits. We extend the lower bound for the number of samples required for the reconstruction problem from Dvir et. al. to the enumeration problem to show that when $\mu = 1- o(1)$, no polynomial time algorithm exists. We also refine the analysis of the probability estimation algorithm from Dvir et. al. Finally, we give a ``Yao principle'' for estimation, showing that either there is an efficient algorithm of a special form or a pair of input distributions which witnesses a polynomially related lower bound on the number of samples needed.

Joint work with Lucia Batman, Cody Murray, and Ramamohan Paturi