Talk:Multi-armed bandit/Archives/2013

Page contents not supported in other languages.
Source: Wikipedia, the free encyclopedia.

Epsilon greedy

The description of -greedy strategy has been removed. Is-it because it is expected that all strategies (there are a lot of them for the bandit problem) must be completed at once (or completed here before committing into the article)? --Vermorel 15:39, 8 October 2005 (UTC)

Gittins Index

I am told that a useful tool for Multi-armed Bandit Problems is the Gittins Index (whatever that is). Perhaps someone could add a definition and explanation. Encyclops 22:03, 27 January 2007 (UTC)

  • This paper should be a good reference: J.C. Gittens. A dynamic allocation index for the discounted multi-armed bandit problem. Biometrika (1979), pp. 580-597 Sancho (talk) 03:55, 14 March 2007 (UTC)

Strategies to be completed

There's no point in having "to be completed" work-in-progress notices for almost two years. I removed these:

to be completed: Probability matching strategies (softmax, gaussmatch, Exp3), Pricing strategies (interval estimation, poker)
Agentbla (talk) 22:45, 4 September 2007 (UTC)

Examples

This page could use some examples. Remember, not everyone who happens upon it is an expert who already knows what it is, or even what all the terms mean. —Preceding unsigned comment added by 96.242.191.15 (talk) 06:00, 1 November 2008 (UTC)

Real life examples would include:

  • optimal strategy to make passenger give up their seats in overbooking situation with the minimal cost.
  • website optimization ala A/B testing but with a great number of variants being available.
  • advertising display optimization for given identified user on a social website ala facebook.

Although the bandit model has typically to be enriched to fit real-life situation (number of arms might be infinite, there might be relationships between arms, etc...). --Joannes Vermorel (talk) 14:57, 6 January 2010 (UTC)

Negative expected value case

It's been awhile but I the way I remember it is that in the case where the expected payoff for the gambles is negative the optimal strategy is any strategy which runs down the income of the player logarithmically - i.e. going to infinity as slowly as possible (under some assumptions). Is this correct or is this just true for the two armed bandit version? Does it generalize?

talk
) 06:34, 19 May 2009 (UTC)

It does not matter if the average payoff is positive or negative, shift all rewards from a constant value (up or down) should not even alter the behavior of most bandit strategies. Then, in the usual bandit model (with 2 arms or more), the player has no control on the pull frequency. --Joannes Vermorel (talk) 13:44, 29 May 2009 (UTC)

Isomorphism with Iterative Prisoner Dilemma Contest?

Presumably, the Multi-arm Bandit Problem can be mapped isomorphically with a contest of algorithms competing in an iterative contest

Prisoner's Dilemma contest á la Robert Axelrod
. However, I am unfamiliar with any attempt apply the results harvested from one problem with the other. Any comments?

--Philopedia (talk) 14:27, 11 December 2009 (UTC)

Not really, there is no adversarial behavior in the multi-armed bandit. Bandit is the just most simple expression of a system that is learned by trial and error. The system itself is markovian, it does not keep memory from what happened before - contrary to prisoner's dilemma. --Joannes Vermorel (talk) 14:51, 6 January 2010 (UTC)

Attention needed

  • Clarification - the article is perhaps a little too scholarly for even mid educational understanding. Consider adding more explanation or examples to make it more easily understood by a layperson

Chaosdruid (talk) 15:50, 7 August 2010 (UTC)

Missing algorithms

The article is badly missing the discussion of algorithms that implement the "optimism in the face of uncertainty principle". These algorithms are optimal in many respects, unlike the other algorithms listed in the article (and they do work better than the algorithms listed in the article). A basic reference is this: Peter Auer, N. Cesa-Bianchi, and P. Fischer. Finite time analysis of the multiarmed bandit problem. Machine learning, 47 2/3:235 – 256, 2002 which, based on previous work by Robbins and Monro and others, introduced the UCB1 algorithm among others. The machine learning community is very active in investigating various extensions of the basic model, such as when the number of actions is very large or infinite, or even adversarial bandits. Szepi (talk) 07:28, 17 March 2011 (UTC)