Click here to flash read.
We consider the upper confidence bound strategy for Gaussian multi-armed
bandits with known control horizon sizes $N$ and build its limiting description
with a system of stochastic differential equations and ordinary differential
equations. Rewards for the arms are assumed to have unknown expected values and
known variances. A set of Monte-Carlo simulations was performed for the case of
close distributions of rewards, when mean rewards differ by the magnitude of
order $N^{-1/2}$, as it yields the highest normalized regret, to verify the
validity of the obtained description. The minimal size of the control horizon
when the normalized regret is not noticeably larger than maximum possible was
estimated.
Click here to read this post out
ID: 126319; Unique Viewers: 0
Unique Voters: 0
Total Votes: 0
Votes:
Latest Change: May 14, 2023, 7:32 a.m.
Changes:
Dictionaries:
Words:
Spaces:
Views: 8
CC:
No creative common's license
No creative common's license
Comments: