How (not) to use SPRT ?

BB+ · Post by **BB+** » Sat Oct 19, 2013 10:44 pm

I recently had some correspondence with Larry Kaufman regarding SPRT. Essentially, he had a result that was +150 Elo (71%) or so, that took longer than expected to stop with an Elo window of [-1.5,4.5] (as used by Stockfish testing). Since the TalkChess thread is a bit spammed by tangents (not all of them uninteresting), I will try to answer the gravamen of the matter here.

The point seems to be, at least as currently implemented with Stockfish testing, the SPRT test computes the following (given the experimental data, including the draw ratio):

a) Probability that the Elo difference is -1.5
b) Probability that the Elo difference is +4.5

When the ratio of these probabilities is sufficiently large/small, the test terminates.

In Larry's case, both of (a) and (b) are essentially infinitesimal, so it is then a (much) higher order effect that needs to kick in to terminate the test [comparison of tails of size (10±epsilon) sigma or something]. What one might want instead is:

c) Probability that the Elo difference is -1.5 or worse
d) Probability that the Elo difference is +4.5 or better

But I don't think this is what is actually implemented. In the code posted by Lucas Braesch, the comments conflate this point IMO.

#   alpha = max typeI error (reached on elo = elo0) 
#   beta = max typeII error for elo >= elo1 (reached on elo = elo1)

Yet when one looks at what the code does, the comment about "elo>=elo1" does not seem proper. To wit:

# Probability laws under H0 and H1
  P0 = bayeselo_to_proba(elo0, drawelo)
  P1 = bayeselo_to_proba(elo1, drawelo)

  # Log-Likelyhood Ratio
  result['llr'] = R['wins']*math.log(P1['win']/P0['win']) + 
                   R['losses']*math.log(P1['loss']/P0['loss']) + R['draws']*math.log(P1['draw']/P0['draw'])

So my understanding is that P0 and P1 are [win,loss,draw] probability vectors corresponding to elo0 (-1.5) and elo1 (4.5), and then the LLR is computed from those two vectors. That is, not including better/worse.

One can simulate the "better/worse" conditions by (say) taking the probabilities for 4.5, 4.55, 4.6, 4.65, etc., and combining them appropriately (maybe integrating over an expected patch distribution, similar to a previous discussion), but with the Stockfish testing environment, where most patches seem reasonably close to the "target" window, the nuance between (a) and (b) versus (c) and (d) might not be that great.

BB+ · Post by **BB+** » Sun Oct 20, 2013 3:01 am

I guess I can make two more points, one practical (and short), and one theoretical (and long).

The practical point is: in the example at hand, the difference between 250 and (say) 100 games is probably not that important (as has been indicated by others).

The theoretical issue (or perhaps an explanation musing on why omitting "better/worse" seems decent in practise) is:

*) The current Stockfish testing method, with its "goalposts" of say -1.5 and 4.5 Elo with no "better/worse" adjectives, can be said to correspond to something like (though not directly) taking a function-evaluation B(sigma) for some Bell-ish function B at a parameter sigma (that measures standard deviation) [here the relevant choices of sigma would correspond to the goalposts].

*) Comparatively, if one were to include the "better/worse" adjectives (I guess assuming a constant prior distribution of submitted patches) rather than evaluating B(sigma) this would then be more like integrating B(t) from sigma to plus/minus infinity. Now when you are integrating in the "proper" direction, B(sigma) is perhaps not all that far from the value of the integral, and so the results are decent. But if you instead integrate B(t) from t=-10 to +infinity, it doesn't look too much like the quite small B(-10), but rather is quite close to 1.

I think the function-evaluation versus values-of-integrals nuance is in essence the difference between "better/worse" and not using such adjectives, though there could be more mathematical details that I am ignoring for now (and maybe the curve is logistic-ish, not Bell, but all that is minor). If the "goalpost" window is reasonably close to the observed values, then the function-evaluations (which ignore better/worse) are probably close enough to the values of the integrals (which include better/close) that the resulting method is not much in error.

To do a more extensive analysis, one would really need to make some assumptions about what the patch distribution looks like (so you would then be integrating B(t)*P(t) where P(t) is the probability of a patch whose Elo corresponds to t deviations, I guess).

So in practice, I have to conclude that with "reasonable" parameter choices for windows, the approximate methods are probably good enough, and that to improve this likely requires much more work. Maybe after you have 1000+ patches tested (like Stockfish has by now), you can then see (or estimate) what the P(t) function looks like in practice, but even then I'm not sure how much it would help.

BB+ · Post by **BB+** » Sun Oct 20, 2013 11:06 pm

A final remark: it seems that it the current setting, with -1.5 and 4.5 Elo as goalposts, one is making an inherent assumption the prior distribution has the same values at these points. That is, a patch worth -1.5 Elo and one worth 4.5 Elo are equally likely to occur. OTOH, even if this is not true, it mainly changes only the long-term behavior about how many patches will be accepted/rejected to the next stage, as the "95%" confidence level on stopping does not mean anymore that 95% of the time an accepted patch would be worth 4.5 Elo (it would be reduced by amount essentially equal to the ratio of -1.5 patches and 4.5 patches, I think).

lucasart · Post by **lucasart** » Mon Oct 21, 2013 12:18 pm

I don't understand what you are trying to say. When it comes to statistics and anything scientific, Larry is the last person you should listen to. I've read so much of his nonsense already on talkchess...

I am the one who wrote that code. I did a lot of numerical simulations, and the results agree perfectly with theoretical results. Theory gives a closed formula for typeI and typeII error as functions of elo. But you need to realize that elo is measured in bayeselo (not in logistic elo) and it is the true value of the parameter (which we do not know).

* alpha is (exactly) the max typeI error of the test. This typeI error is reached when elo=elo0-
* beta is (exactly) the max typeII error of the test in the zone elo >= elo1. This value is reached when elo=elo1+.

In the rande [elo0,elo1] typeII error ranges from 1-alpha when elo=elo0 to beta when elo=elo1.

Michel Van den Bergh also derived formulas for the hybrid SPRT test (ie. SPRT with a cap on the number of games). Again the numerical simulator agrees with his formulas perfectly.

Michel Van den Bergh · Post by **Michel Van den Bergh** » Mon Oct 21, 2013 6:44 pm

BB+ wrote:A final remark: it seems that it the current setting, with -1.5 and 4.5 Elo as goalposts, one is making an inherent assumption the prior distribution has the same values at these points. That is, a patch worth -1.5 Elo and one worth 4.5 Elo are equally likely to occur. OTOH, even if this is not true, it mainly changes only the long-term behavior about how many patches will be accepted/rejected to the next stage, as the "95%" confidence level on stopping does not mean anymore that 95% of the time an accepted patch would be worth 4.5 Elo (it would be reduced by amount essentially equal to the ratio of -1.5 patches and 4.5 patches, I think).

Hi Mark,

My favorite book on this subject is

http://books.google.be/books/about/Sequ ... edir_esc=y

The SPRT belongs to a fascinating part of statistics called "sequential testing". The proof that the SPRT has the claimed Type I,II
error levels is easy (it is due to Wald) but this is about the _only_ really easy result in sequential testing.

The SPRT is also one of the few results which is not asymptotic. On the other hand asymptotic sequential testing is basically a
solved problem. Any question you may ask can be answered although sometimes the answer requires non-trivial numerical computations.

Michel

BB+ · Post by **BB+** » Mon Oct 21, 2013 11:48 pm

I don't understand what you are trying to say.

Here it is in a nutshell. The current SPRT code you give has two alternative hypothesis, and for the stopping criterion computes the likelihood ratio of these.

a) The patch is worth -1.5 Elo.
b) The patch is worth +4.5 Elo.

Larry instead wanted (or thought) that these should be

c) The patch is worth -1.5 Elo or less.
d) The patch is worth +4.5 Elo or more.

I think we should agree on whether your code does (a&b) or (c&d) before proceeding to discuss which is better, how it relates to Type I/II error, or indeed if it matters much.

I claim your code does (a&b) because:

# Probability laws under H0 and H1
  P0 = bayeselo_to_proba(elo0, drawelo)
  P1 = bayeselo_to_proba(elo1, drawelo)

  # Log-Likelyhood Ratio
  result['llr'] = R['wins']*math.log(P1['win']/P0['win']) + 
                   R['losses']*math.log(P1['loss']/P0['loss']) + R['draws']*math.log(P1['draw']/P0['draw'])

So my understanding is that P0 and P1 are [win,loss,draw] probability vectors corresponding to elo0 (-1.5) and elo1 (4.5), and then the LLR is computed from those two vectors. Am I incorrect about what P0 and P1 are? Or are they really some surrogate measure from elo<=elo0 and elo>=elo1? I can agree that the inequalities could come into the picture when one analyses Type I/II rates from the stopping critertion (perhaps assuming a patch distribution), but the point in question seems to occur prior to such analysis --- I am just stating that the stopping criterion itself only depends on (a&b), not (c&d).

My first post also contained a suggested fix to implement (c&d) [essentially integration of P0 from elo0=-infinity to -1.5 and similarly for P1 with elo1=4.5 to infinity], though admittedly one probably needs some sort of prior distribution if one is heading toward such pedantry. My second post attempted to analyse why in the Stockfish setting, one does not see such "crazy" results as Larry saw (the value of the P0-integral is reasonably close to the evaluation of P0 at the endpoint in most cases). My third post indicated that one really needs to know the patch distribution to assert anything about Type I/II rates.

BB+ · Post by **BB+** » Tue Oct 22, 2013 12:27 am

Perhaps an easy way to distinguish whether the code uses (a&b) versus (c&d) is to use Larry's specific example.

What do you have as the likelihood of P0? Similarly with P1. I ask for these separately rather than their ratio, as I think you will find that they are both quite small (thus indicating a&b), rather than P0 being nearly 0 and P1 being nearly 1 (which would be the c&d case).

The data in question are 149 wins, 30 losses, and 94 draws (273 games).

I get [0.546,0.110,0.344] as the WDL ratios, a "bayeselo" of 197.6 and "drawelo" of 165.7, then P0 is [0.276,0.280,0.444] and P1 is [0.283,0.273,0.444]. The likelihood [computed from the trinomial distribution] of observing [0.546,0.110,0.344] in 273 games under P0 is approximately zero (1.12e-24) as it is under P1 (2.195e-23). But their ratio is about 20, as expected.

Michel Van den Bergh wrote:The SPRT belongs to a fascinating part of statistics called "sequential testing". The proof that the SPRT has the claimed Type I,II error levels is easy (it is due to Wald) but this is about the _only_ really easy result in sequential testing.

As you have indicated elsewhere, these tests really need to have a stringently posed problem for the relevant theory to apply. Larry's interpretation [or expectation] of the problem (the stopping criterion itself contained better/worse labels) is not the same as what Lucas had done (the stopping criterion does not contain such labels, though the ensuing Type I/II error analysis does). Neither is particularly "wrong", but one cannot interpret rather extreme data samples from the eyes of one paradigm when the other is what is being applied.

BB+ · Post by **BB+** » Tue Oct 22, 2013 7:44 pm

Geh, I seem to have stated the conditions invertedly. They should be:

a) The probability that a -1.5 Elo patch gives the observed results
b) The probability that a 4.5 Elo patch gives the observed results

Sorry for any confusion that caused.

BB+ · Post by **BB+** » Mon Oct 28, 2013 11:09 pm

If I understand the literature correctly the use of "composite" hypothesis was already considered by Wald (reducing these to "simple" ones via weights where possible), and the desired properties in the case here follow by monotonicity. A more commonly described case is "x>A" versus "x<B" when A equals B, but the point of having a gap in the middle is to speed convergence in the don't care region.

OpenChess

OpenChess

How (not) to use SPRT ?

How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?

Re: How (not) to use SPRT ?