Please use this identifier to cite or link to this item: https://scholarbank.nus.edu.sg/handle/10635/172750
DC FieldValue
dc.titleHedging the Drift: Learning to Optimize under Non-Stationarity
dc.contributor.authorCHEUNG WANG CHI
dc.contributor.authorSimchi-Levi, David
dc.contributor.authorZhu, Ruihao
dc.date.accessioned2020-08-16T02:51:38Z
dc.date.available2020-08-16T02:51:38Z
dc.date.issued2020-08-06
dc.identifier.citationCHEUNG WANG CHI, Simchi-Levi, David, Zhu, Ruihao (2020-08-06). Hedging the Drift: Learning to Optimize under Non-Stationarity. MANAGEMENT SCIENCE. ScholarBank@NUS Repository.
dc.identifier.issn0025-1909
dc.identifier.urihttps://scholarbank.nus.edu.sg/handle/10635/172750
dc.description.abstractWe introduce data-driven decision-making algorithms that achieve state-of-the-art dynamic regret bounds for a collection of non-stationary stochastic bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown a priori and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Beginning with the linear bandit setting, we design and analyze a sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound when the underlying variation budget is known. This budget quantifies the total amount of temporal variation of the latent environments. Boosted by the novel Bandit over-Bandit framework that adapts to the latent changes, our algorithm can further enjoy nearly optimal dynamic regret bounds in a (surprisingly) parameter-free manner. We extend our results to other related bandit problems, namely the multi-armed bandit, generalized linear bandit, and combinatorial semi-bandit settings, which model a variety of operations research applications. In addition to the classical exploration exploitation trade-off, our algorithms leverage the power of the “forgetting principle” in the learning pro cesses, which is vital in changing environments. Extensive numerical experiments with synthetic datasets and a dataset of an online auto-loan company during the severe acute respiratory syndrome (SARS) epi demic period demonstrate that our proposed algorithms achieve superior performance compared to existing algorithms.
dc.publisherInstitute for Operations Research and the Management Sciences
dc.sourceElements
dc.subjectdata-driven decision-making
dc.subjectnon-stationary bandit optimization
dc.subjectparameter-free algorithm
dc.typeArticle
dc.date.updated2020-08-13T15:51:42Z
dc.contributor.departmentINDUSTRIAL SYSTEMS ENGINEERING AND MANAGEMENT
dc.description.sourcetitleMANAGEMENT SCIENCE
dc.description.placeUnited States
dc.published.statePublished
Appears in Collections:Staff Publications
Elements

Show simple item record
Files in This Item:
File Description SizeFormatAccess SettingsVersion 
non-stationary bandit optimization.pdfAccepted version944.23 kBAdobe PDF

OPEN

Post-printView/Download

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.