Randomized weighted majority algorithm

The randomized weighted majority algorithm is an algorithm in machine learning theory.^[1] It improves the mistake bound of the weighted majority algorithm.

Example

Imagine that every morning before the stock market opens, we get a prediction from each of our "experts" about whether the stock market will go up or down. Our goal is to somehow combine this set of predictions into a single prediction that we then use to make a buy or sell decision for the day. The RWMA gives us a way to do this combination such that our prediction record will be nearly as good as that of the single best expert in hindsight.

Motivation

In machine learning, the weighted majority algorithm (WMA) is a meta-learning algorithm which "predicts from expert advice". It is not a randomized algorithm:

initialize all experts to weight 1
for each round:
    add each expert's weight to the weighted sum of the option they predicted
    predict the option with the largest weighted sum
    multiply the weights of all experts that make a mistake by  ${\frac {1}{2}}$

Suppose there are $n$ experts and the best expert makes $m$ mistakes. The weighted majority algorithm (WMA) makes at most $2.4(\log _{2}n+m)$ mistakes, which is not a very good bound and especially problematic for highly error-prone experts (e.g. the best expert still makes a mistake 20% of the time.) Suppose we do $N=100$ rounds using $n=10$ experts. If the best expert makes $m=20$ mistakes, we can only guarantee an upper bound of $2.4(\log _{2}10+20)\approx 56$ on our number of mistakes.

As this is a known limitation of WMA, attempts to improve this shortcoming have been explored in order to improve the dependence on $m$ . We can do better by introducing randomization.

Randomized weighted majority algorithm (RWMA)

Instead of predicting based on majority vote, the weights are used as probabilities: hence the name randomized weighted majority. If $w_{i}$ is the weight of expert $i$ , let $W=\sum _{i}w_{i}$ . We will follow expert $i$ with probability ${\frac {w_{i}}{W}}$ . Here is the algorithm:

initialize all experts to weight 1.
for each round:
    add each expert's weight to the weighted sum of the option they predicted
    normalize the weighted sums into probabilities of picking each option
    make a prediction based on those probabilities
    multiply the weights of all experts that make a mistake by  $\beta$

The goal is to bound the worst-case expected number of mistakes, assuming that the adversary (the world) has to select one of the answers as correct before we make our coin toss. Why is this better in the worst case? Idea: the worst case for the deterministic algorithm (weighted majority algorithm) was when the weights split 50/50. But, now it is not so bad since we also have a 50/50 chance of getting it right. Also, to trade-off between dependence on $m$ and $\log _{2}n$ , we will generalize to multiply by $\beta <1$ , instead of necessarily by ${\frac {1}{2}}$ .

Let $W_{t}$ denote the total weight and $F_{t}$ denote the fraction of weight on the wrong answers at round $t$ . By definition, $F_{t}$ is the probability that the algorithm makes a mistake on round $t$ . It follows, then, from the linearity of expectation, that if $M$ denotes the total number of mistakes made during the entire process, $E[M]=\sum _{t=1}^{N}F_{t}$ .

Now, notice that after round $t$ , the total weight is decreased by $\ (1-\beta )F_{t}W_{t}$ , since all weights corresponding to a wrong answer are multiplied by $\ \beta$ . It then follows that $W_{t+1}=W_{t}(1-(1-\beta )F_{t})$ . By telescoping, since $W_{1}=n$ , it follows that the total weight after the process concludes is

{\begin{aligned}W=n\prod _{t=1}^{N}(1-(1-\beta )F_{t}).\end{aligned}}

On the other hand, suppose that $\ m$ is the number of mistakes made by the best-performing expert. At the end, this expert has weight $\ \beta ^{m}$ . It follows, then, that the total weight is at least this much; in other words, $\ W\geq \beta ^{m}$ . This inequality and the above result imply

{\begin{aligned}n\prod _{t=1}^{N}(1-(1-\beta )F_{t})\geq \beta ^{m}.\end{aligned}}

Taking the natural logarithm of both sides yields

{\begin{aligned}\ln n+\sum _{t=1}^{N}\ln(1-(1-\beta )F_{t})\geq m\ln \beta .\end{aligned}}

Now, the Taylor series of the natural logarithm is

{\begin{aligned}\ln(1-x)=-x-{\frac {x^{2}}{2}}-{\frac {x^{3}}{3}}-\cdots \end{aligned}}

In particular, it follows that $\ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}$ .
Thus,

{\begin{aligned}\ln n-(1-\beta )\sum _{t=1}^{N}F_{t}\geq m\ln \beta .\end{aligned}}

Recalling that $E[M]=\sum _{t=1}^{N}F_{t}$ and rearranging, it follows that

{\begin{aligned}E[M]\leq {\frac {m\ln(1/\beta )+\ln(n)}{1-\beta }}={\frac {\ln(1/\beta )}{1-\beta }}m+{\frac {1}{1-\beta }}\ln(n).\end{aligned}}

Now, as $\beta \to 1$ from below, the first constant tends to $1$ ; however, the second constant tends to $+\infty$ . To quantify this relationship, define $\varepsilon =1-\beta$ to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm,

{\begin{aligned}{\frac {\ln(1/\beta )}{1-\beta }}=-{\frac {\ln(\beta )}{1-\beta }}={\frac {-\ln(1-\varepsilon )}{\varepsilon }}={\frac {\varepsilon +{\frac {\varepsilon ^{2}}{2}}+{\frac {\varepsilon ^{3}}{3}}+\cdots }{\varepsilon }}=1+{\frac {\varepsilon }{2}}+O(\varepsilon ^{2})\end{aligned}}

It then follows that the mistake bound, for small $\varepsilon$ , can be written in the form $\ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)$ .

In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of $\varepsilon$ and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert.

Uses of Randomized Weighted Majority Algorithm (RWMA)

The Randomized Weighted Majority Algorithm can be used to combine multiple algorithms in which case RWMA can be expected to perform nearly as well as the best of the original algorithms in hindsight.

Furthermore, one can apply the Randomized Weighted Majority Algorithm in situations where experts are making choices that cannot be combined (or can't be combined easily). For example, RWMA can be applied to repeated game-playing or the online shortest path problem. In the online shortest path problem, each expert is telling you a different way to drive to work. You pick one path using RWMA. Later you find out how well you would have done using all of the suggested paths and penalize appropriately. To do this right, we want to generalize from "losses" of 0 or 1 to losses in [0,1]. The goal is to have an expected loss not much larger than the loss of the best expert. We can generalize the RWMA by applying a penalty of $\beta ^{loss}$ (i.e. two losses of one half result in the same weight as one loss of 1 and one loss of 0). The analysis given in the previous section does not change significantly.

Extensions

Multi-armed bandit problem.
Efficient algorithm for some cases with many experts.
Sleeping experts/"specialists" setting.

References

^ Littlestone, N.; Warmuth, M. (1994). "The Weighted Majority Algorithm". Information and Computation. 108 (2): 212–261. doi:10.1006/inco.1994.1009.