David Kempe (Theory Seminar S15)

"Incentivizing Exploration"
David Kempe
(University of Southern California)
Monday, April 20th, 2015, 2:00 pm
EBU3B, Room 4258
Abstract:
We study a Bayesian multi-armed bandit (MAB) setting in which a principal seeks to maximize the sum of expected time-discounted rewards obtained by pulling arms, when the arms are actually pulled by selfish and myopic individuals.
Since such individuals pull the arm with highest expected posterior reward (i.e., they always exploit and never explore), the principal must incentivize them to explore by offering suitable payments. Among others, this setting models crowdsourced information discovery and funding agencies incentivizing scientists to perform high-risk, high-reward research.

We explore the tradeoff between the principal's total expected time-discounted incentive payments, and the total time-discounted rewards realized. Specifically, with a time-discount factor gamma in (0,1), let OPT denote the total expected time-discounted reward achievable by a principal who pulls arms directly without having to incentivize selfish agents, in a MAB problem. We call a (payment, reward) combination (b,rho) in [0,1]^2 achievable if for every MAB instance, using expected time-discounted payments of at most b*OPT, the principal can guarantee an expected time-discounted reward of at least rho*OPT. Our main result is a complete characterization of achievable (payment, reward) pairs: (b,rho) is achievable if and only if sqrt(b) + sqrt(1-rho) >= sqrt(gamma).

In proving this characterization, we analyze so-called time-expanded policies, which in each step let the agents choose myopically with some probability p, and incentivize them to choose "optimally" with probability 1-p. The analysis of time-expanded policies leads to a question that may be of independent interest: If the same MAB instance (without selfish agents) is considered under two different time-discount rates gamma, eta, how small can the ratio of OPT(eta) to OPT(gamma) be? We give a complete answer to this question, showing that OPT(eta) >= ((1-gamma)^2/(1-eta)^2) OPT(gamma), and that this bound is
tight.

Joint work with Peter Frazier, Jon Kleinberg and Robert Kleinberg