Randomized weighted majority algorithm

The randomized weighted majority algorithm (RWMA) is an algorithm in machine learning theory.^[1] It is useful in scenarios where a learner faces a sequence of trials and predicts at every step, and where we have reason to believe that some pool of known algorithms will perform well, but we do not know which. It is a simple and effective method based on weighted voting, that improves on the mistake bound of the weighted majority algorithm.

Example

Imagine that every morning before the stock market opens, we get a prediction from a series of "experts" about whether the stock market will go up or down in the next day. Our goal is to aggregate the resulting set of predictions into a single prediction that can then be used to make a buy or sell decision for the day. The RWMA is a method for this aggregation such that, over time, the resulting record of predictions will be nearly as good as that of whichever expert, in hindsight, gave the most accurate predictions.

Motivation

In machine learning, the weighted majority algorithm (WMA) is a deterministic meta-learning algorithm for aggregating expert predictions. In pseudocode, the WMA is as follows:

initialize all experts to weight 1.
for each round:
    poll all the experts and predict based on a weighted majority vote of their predictions.
    cut in half the weights of all experts that make a mistake.

Suppose there are $n$ experts and the best expert makes $m$ mistakes. The weighted majority algorithm (WMA) makes at most $2.4(\log _{2}n+m)$ mistakes. This bound is highly problematic for cases with highly error-prone experts. Suppose, for example, the best expert makes a mistake 20% of the time; that is, in $N=100$ rounds using $n=10$ experts, the best expert makes $m=20$ mistakes. Then, the deterministic weighted majority algorithm only guarantees an upper bound of $2.4(\log _{2}10+20)\approx 56$ .

This is not a very good bound, and can be improved by introducing randomization.

Randomized weighted majority algorithm (RWMA)

The randomized weighted majority algorithm is an attempt to improve the dependence of the mistake bound of the WMA on $m$ . Instead of predicting based on majority vote, the weights are used as probabilities (hence the name randomized weighted majority).

Precisely, if $w_{i}$ is the weight of expert $i$ , let $W=\sum _{i}w_{i}$ . Then, we follow expert $i$ with probability ${\frac {w_{i}}{W}}$ .

As before, the goal is to bound the worst-case expected number of mistakes, assuming that the adversary (the world) must select an answer as correct before we randomly choose which expert to follow. This is better because the worst case for the deterministic weighted majority algorithm is when the weights are nearly split 50/50, but the randomized weighted majority algorithm still has a 50/50 chance of getting it right in such cases.

Note: to trade off between dependence in the mistake bound on $m$ and $\log _{2}n$ , we will consider any possible constant $0<\beta <1$ for multiplying the weight of an expert after a missed guess (instead of necessarily using ${\frac {1}{2}}$ as in the original description of the WMA).

Analysis

Let $W_{t}$ denote the total weight and $F_{t}$ denote the fraction of weight on the wrong answers at round $t$ . By definition, $F_{t}$ is the probability that the algorithm makes a mistake on round $t$ . It follows, then, from the linearity of expectation, that if $M$ denotes the total number of mistakes made during the entire process, $E[M]=\sum _{t=1}^{N}F_{t}$ .

Now, notice that after round $t$ , the total weight is decreased by $\ (1-\beta )F_{t}W_{t}$ , since all weights corresponding to a wrong answer are multiplied by $\ \beta$ . It then follows that $W_{t+1}=W_{t}(1-(1-\beta )F_{t})$ . By telescoping, since $W_{1}=n$ , it follows that the total weight after the process concludes is

{\begin{aligned}W=n\prod _{t=1}^{N}(1-(1-\beta )F_{t}).\end{aligned}}

On the other hand, suppose that $\ m$ is the number of mistakes made by the best-performing expert. At the end, this expert has weight $\ \beta ^{m}$ . It follows, then, that the total weight is at least this much; in other words, $\ W\geq \beta ^{m}$ . This inequality and the above result imply

{\begin{aligned}n\prod _{t=1}^{N}(1-(1-\beta )F_{t})\geq \beta ^{m}.\end{aligned}}

Taking the natural logarithm of both sides yields

{\begin{aligned}\ln n+\sum _{t=1}^{N}\ln(1-(1-\beta )F_{t})\geq m\ln \beta .\end{aligned}}

Now, the Taylor series of the natural logarithm is

{\begin{aligned}\ln(1-x)=-x-{\frac {x^{2}}{2}}-{\frac {x^{3}}{3}}-\cdots \end{aligned}}

In particular, it follows that $\ \ln(1-(1-\beta )F_{t})<-(1-\beta )F_{t}$ .
Thus,

{\begin{aligned}\ln n-(1-\beta )\sum _{t=1}^{N}F_{t}\geq m\ln \beta .\end{aligned}}

Recalling that $E[M]=\sum _{t=1}^{N}F_{t}$ and rearranging, it follows that

{\begin{aligned}E[M]\leq {\frac {m\ln(1/\beta )+\ln(n)}{1-\beta }}={\frac {\ln(1/\beta )}{1-\beta }}m+{\frac {1}{1-\beta }}\ln(n).\end{aligned}}

Now, as $\beta \to 1$ from below, the first constant tends to $1$ ; however, the second constant tends to $+\infty$ . To quantify this relationship, define $\varepsilon =1-\beta$ to be the penalty associated with getting a prediction wrong. Then, again applying the Taylor series of the natural logarithm,

{\begin{aligned}{\frac {\ln(1/\beta )}{1-\beta }}=-{\frac {\ln(\beta )}{1-\beta }}={\frac {-\ln(1-\varepsilon )}{\varepsilon }}={\frac {\varepsilon +{\frac {\varepsilon ^{2}}{2}}+{\frac {\varepsilon ^{3}}{3}}+\cdots }{\varepsilon }}=1+{\frac {\varepsilon }{2}}+O(\varepsilon ^{2})\end{aligned}}

It then follows that the mistake bound, for small $\varepsilon$ , can be written in the form $\ \left(1+{\frac {\epsilon }{2}}+O(\varepsilon ^{2})\right)m+\epsilon ^{-1}\ln(n)$ .

In English, the less that we penalize experts for their mistakes, the more that additional experts will lead to initial mistakes but the closer we get to capturing the predictive accuracy of the best expert as time goes on. In particular, given a sufficiently low value of $\varepsilon$ and enough rounds, the randomized weighted majority algorithm can get arbitrarily close to the correct prediction rate of the best expert.

Revisiting motivation

Recall that the motivation for the randomized weighted majority algorithm was given by an example where the best expert makes a mistake 20% of the time. Precisely, in $N=100$ rounds, with $n=10$ experts, where the best expert makes $m=20$ mistakes, the deterministic weighted majority algorithm only guarantees an upper bound of $2.4(\log _{2}10+20)\approx 56$ . By the analysis above, it follows that minimizing the number of expected mistakes is equivalent to minimizing the function

{\begin{aligned}{\frac {\ln(1/\beta )}{1-\beta }}20+{\frac {1}{1-\beta }}\ln(10).\end{aligned}}

Computational methods show that the optimal value is roughly $\beta \approx 0.641$ , which results in the minimal worst-case number of expected mistakes of $E[M]\approx 31.19$ . When the number of rounds is increased (say, to $N=1000000$ ) while the accuracy rate of the best expert is kept the same the improvement can be even more dramatic; the weighted majority algorithm guarantees only a worst-case mistake rate of 48.0%, but the randomized weighted majority algorithm, when properly tuned for the optimal value of $\varepsilon \approx 0.0117$ , achieves a worst-case mistake rate of 20.2%.

Uses of the randomized weighted majority algorithm

Besides directly aggregating expert predictions, the randomized weighted majority algorithm can also be used to aggregate the predictions of other prediction aggregation algorithms. In this case, the analysis above demonstrates that the RWMA can be expected to perform nearly as well as the best of the original algorithms in hindsight.

The RWMA is also particularly useful for situations where experts are making choices that cannot be combined. Consider, for example, the online shortest path problem. In this problem, a series of experts are providing directions about how to drive to a destination. Using some prediction aggregation algorithm, you choose a path without global knowledge of the graph. Afterwards, you consider how you would have done following the best expert's suggested path, and penalize accordingly. The goal is to minimize this penalty, or, in other words, to minimize the length of the final path respective to the shortest proposed path. The randomized weighted majority algorithm can be applied to this problem to minimize this penalty.

Extensions

Multi-armed bandit problem.
Efficient algorithm for some cases with many experts.
Sleeping experts/"specialists" setting.

References

^ Littlestone, N.; Warmuth, M. (1994). "The Weighted Majority Algorithm". Information and Computation. 108 (2): 212–261. doi:10.1006/inco.1994.1009.