"Boosting Simple Learners Simply"
Russell Impagliazzo (UCSD)
Monday, May 3rd 2021, 2-3pm
Boosting is a general technique for machine learning that combines weakly correlated predictions for an unknown function on different sub-distributions into a strong predictor that is correct almost everywhere (Shapire introduced boosting; see the textbook by Freund and Shapire). Boosting is of importance both in the theory of PAC-learning and in many applications of machine learning. There is also a connection to averge-case computational complexity and derandomization via the Hardcore Distribution Lemma of Impagliazzo and Holenstein. This connection was made precise by Klivans and Servidio, who defined a sub-class of boosting algorithms called smoothed boosting algorithms that were implicit in the hardcore lemma, and also have other features that are useful for robust machine learning.
One important efficiency measure for a boosting algorithm is its round complexity, the number of times a weak learner must be called in order to get a strong predictor. It was well known that, for general weak learners, the optimal round complexity is $$\Theta( \log 1/\epsilon/\gamma^2)$$ where $$\gamma$$ is the correlation guarantee for the weak learner, and $$\epsilon$$ is the final error. This is achieved by boosting algorithms such as Freund and Shapire's AdaBoost, and for smoothed boosting by Barak, Hardt, and Kale Bregman projection booster. Because this is ``optimal'' for most settings, while much effort has been devoted to matching this bound for different types of boosting, I don't think anyone tried to beat it until recently.
Recently, Alon, Gonen, Hazan and Moran showed that you CAN actually beat this round complexity for special cases of weak learner. Namely, if the weak learner is guaranteed to provide a correlating hypothesis from a class of small VC dimension, there is a boosting algorithm whose round complexity is $poly-log (1/\epsilon, 1/\gamma)/\gamma$, nearly a quadratic improvement over the general lower bound. Their algorithm has polynomial sample complexity that depends on the VC dimension.
In this talk, I will first give a simplified version of their algorithm and its analysis in the filtering boosting setting (as opposed to the batch boosting setting which the paper above considered). Then I will show a related smoothed boosting algorithm that has round complexity at most $O ( \log (1/\epsilon\gamma)/\epsilon \gamma)$, which, while not being optimal in $\epsilon$, still beats the general lower bound for many choices of the parameters. This leaves open the question of finding optimal smoothed boosting algorithms for small VC dimension weak learners, and perhaps using these to derive hardcore distribution lemmas and regularity lemmas for small VC dimension classes. This is unpublished work in progress.