Hedging the Drift: Learning to Optimize under Non-Stationarity

Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/172750

DC Field	Value
dc.title	Hedging the Drift: Learning to Optimize under Non-Stationarity
dc.contributor.author	CHEUNG WANG CHI
dc.contributor.author	Simchi-Levi, David
dc.contributor.author	Zhu, Ruihao
dc.date.accessioned	2020-08-16T02:51:38Z
dc.date.available	2020-08-16T02:51:38Z
dc.date.issued	2020-08-06
dc.identifier.citation	CHEUNG WANG CHI, Simchi-Levi, David, Zhu, Ruihao (2020-08-06). Hedging the Drift: Learning to Optimize under Non-Stationarity. MANAGEMENT SCIENCE. ScholarBank@NUS Repository.
dc.identifier.issn	0025-1909
dc.identifier.uri	https://scholarbank.nus.edu.sg/handle/10635/172750
dc.description.abstract	We introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of non-stationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traﬃc network routing in changing environments. We show how the diﬃculty posed by the (unknown a priori and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upper conﬁdence bound algorithm that achieves the optimal dynamic regret bound when the underlying variation budget is known. This budget quantiﬁes the total amount of temporal variation of the latent environments. Boosted by the novel Bandit over-Bandit framework that adapts to the latent changes, our algorithm can further enjoy nearly optimal dynamic regret bounds in a (surprisingly) parameter-free manner. We extend our results to other related bandit problems, namely the multi-armed bandit, generalized linear bandit, and combinatorial semi-bandit settings, which model a variety of operations research applications. In addition to the classical exploration exploitation trade-oﬀ, our algorithms leverage the power of the “forgetting principle” in the learning pro cesses, which is vital in changing environments. Extensive numerical experiments with synthetic datasets and a dataset of an online auto-loan company during the severe acute respiratory syndrome (SARS) epi demic period demonstrate that our proposed algorithms achieve superior performance compared to existing algorithms.
dc.publisher	Institute for Operations Research and the Management Sciences
dc.source	Elements
dc.subject	data-driven decision-making
dc.subject	non-stationary bandit optimization
dc.subject	parameter-free algorithm
dc.type	Article
dc.date.updated	2020-08-13T15:51:42Z
dc.contributor.department	INDUSTRIAL SYSTEMS ENGINEERING AND MANAGEMENT
dc.description.sourcetitle	MANAGEMENT SCIENCE
dc.description.place	United States
dc.published.state	Published
Appears in Collections:	Staff Publications Elements

Show simple item record

Files in This Item:

File	Description	Size	Format	Access Settings	Version
non-stationary bandit optimization.pdf	Accepted version	944.23 kB	Adobe PDF	OPEN	Post-print	View/Download

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM