Asymptotic theory of sparse Bradley-Terry model

The Annals of Applied Probability

2020, Vol. 30, No. 5, 2491–2515

https://doi.org/10.1214/20-AAP1564

ASYMPTOTIC THEORY OF SPARSE BRADLEY–TERRY MODEL

Y RUIJIAN HAN

,ROUGANG YE

†

,CHUNXI TAN

‡

AND KANI CHEN

Department of Mathematics, Hong Kong University of Science and Technology,

†

‡

The Bradley–Terry model is a fundamental model in the analysis of net-

work data involving paired comparison. Assuming every pair of subjects in

the network have an equal number of comparisons, Simons and Yao (Ann.

Statist. 27 (1999) 1041–1060) established an asymptotic theory for statistical

estimation in the Bradley–Terry model. In practice, when the size of the net-

work becomes large, the paired comparisons are generally sparse. The spar-

sity can be characterized by the probability p

that a pair of subjects have at

least one comparison, which tends to zero as the size of the network n goes to

inﬁnity. In this paper, the asymptotic properties of the maximum likelihood

estimate of the Bradley–Terry model are shown under minimal conditions of

the sparsity. Speciﬁcally, the uniform consistency is proved when p

is as

small as the order of (log n)

/n, which is near the theoretical lower bound

log n/n by the theory of the Erd

os–Rényi graph. Asymptotic normality and

inference are also provided. Evidence in support of the theory is presented in

simulation results, along with an application to the analysis of the ATP data.

1. Introduction. Is Roger Federer a better tennis player than John McEnroe? Is research

article A more inﬂuential than research article B, among a collection of all research articles in

a scientiﬁc ﬁeld? Is webpage A more important than webpage B, among the existing millions

of webpages? Is person A more popular than person B in a large social network, such as

Twitter users? These questions may be answered by analysis of paired comparison data in a

network. The paired comparison may be in terms of head-to-head match outcomes, citation

of a research article by another, a webpage containing a link of another webpage, a user

retweeting the tweet of another user, etcetera. When the size of the network, such as a total

number of webpages, becomes large, paired comparisons are generally sparse. The sparsity

may be described in different ways, such as the total number of observed comparisons divided

by the total number of subjects in the network. Throughout this paper, the size of the network

in study is denoted as n, and we assume any pair has a comparison with probability p

.The

degree of sparsity is then characterized by the size of p

, the smaller the sparser.

For paired comparison, the Bradley–Terry model (Bradley and Terry (1952)) is one of the

most commonly used models. Consider n subjects in a network. Subject i has merit u

for

i =0,...,n−1, where u

∈R and u

> 0. The Bradley–Terry model assumes the probability

that subject i defeats j as

(1.1) p

,i,j=0,...,n− 1;i = j.

The generalizations of the Bradley–Terry model can be seen in, for example, Luce (1959),

Rao and Kupper (1967)andAgresti (1990), among many others. For estimation of the merits

based on a set of paired comparison data, the maximum likelihood estimation (MLE) is a

common choice. It is much desired to justify the asymptotic properties of the MLE, particu-

larly when the comparisons are sparse. A distinct feature of this problem is that the number

Received June 2019; revised October 2019.

MSC2020 subject classiﬁcations. Primary 60F05; secondary 62E20, 62F12, 62J15.

Key words and phrases. Bradley–Terry model, sparsity, uniform consistency, asymptotic normality, maximum

likelihood estimation.

2491

2492 HAN, YE, TAN AND CHEN

of parameters, which is the same as the number of subjects, tends to inﬁnity. Moreover, in the

case of sparsity, the number of comparisons per pair is 0 with probability tending to 1.

Suppose any pair has a ﬁxed positive number of comparisons, Simons and Yao (1999)

proved the uniform consistency and asymptotic normality of the MLE for the Bradley–Terry

model. Under this assumption, there would be at least n(n −1)/2 comparisons in total. Fur-

ther extension was reported in Yan, Yang and Xu (2012) with relaxed conditions but still

requiring the number of comparisons at the order of n

. Both papers considered nonsparse

cases, where p

has a positive lower bound and, as a result, does not go to 0. In this paper, we

show the asymptotic properties of the MLE for sparse comparisons. In particular, when the

maximum ratio of merits are bounded, the uniform consistency holds as long as p

is greater

than (log n)

/n, and the asymptotic normality is true when p

is greater than the order of

(log n)

1/5

1/10

The order of (log n)

/n required for the uniform consistency is close to the necessary

theoretical lower bound, log n/n, below which a unique MLE does not exist. The network

we consider can be regarded as the Erd

os–Rényi graph G(n, p

) (Erd

os and Rényi (1959)),

where each node stands for a subject and each edge stands for the comparison between the

two corresponding nodes. Erd

os and Rényi (1960) showed that the Erd

os–Rényi graph will

be disconnected with positive probability if p

<log n/n,forany<1. As to be seen,

a disconnected graph will fail the condition of the existence and uniqueness of the MLE of

Bradley–Terry model, implying that not all subjects are comparable. In the sense of sparsity,

the theory established in this paper is nearly optimal.

Although we assume that each pair of subjects in the network has a comparison with

the same probability p

, one can follow our proof to extend it to the case with different

comparison probabilities at the order of p

. The main contribution of this article is to show

the asymptotic theorem of MLE when p

→0 and how small p

can be to obtain it.

We note that Negahban, Oh and Shah (2012)andMaystre and Grossglauser (2015)showed

the consistency of the MLE under 

norm for sparse comparison data. The 

norm therein

is normalized by

√

n. Since the network size goes to inﬁnity, consistency under the 

norm

does not ensure the consistency of the merits of any ﬁxed number of subjects. In other words,

with their normalized 

consistency, one cannot tell for sure that the estimation of merits

ratio of any given pair is accurate. In this sense, the uniform consistency is a much desired

result.

This paper is organized as follows. In Section 2, we show the large sample properties of

the MLE. Simulation results and analysis of the ATP data are given in Section 3. Section 4

contains some concluding remarks. All proofs are relegated to the appendices.

2. Main results. Consider any two subjects i and j with 0 ≤ i, j ≤n −1. Let t

denote

the number of comparisons between subjects i and j and a

denote the number of times

that i defeats j .Seta

= t

= 0 for simplicity of notation. Then t

= t

= a

+ a

for

all i, j . For slightly more generality, throughout the sequel, we assume t

follows binomial

distribution, ∼ Bin(T , p

) where T is a positive integer not depending on n.Givent

,the

Bradley–Terry model implies a

∼ Bin(t

),wherep

= u

/(u

). Without loss of

generality, one can take T as 1 for ease of understanding.

Based on the observations of paired comparisons {a

:0 ≤i, j ≤ n −1}, the likelihood

function is

(2.1) L(u) ∝

n−1



i,j=0

i=j



n−1

i=0



0≤i<j≤n−1

)

SPARSE BRADLEY–TERRY MODEL 2493

where a



n−1

j=0

is the total number of comparisons that subject i wins and u =

,...,u

n−1

) is the merits vector. By the method of maximum likelihood estimation,

the likelihood equations are

(2.2) a

n−1



j=0

ˆu

+ˆu

,i=0,...,n− 1,

where ˆu = ( ˆu

, ˆu

,..., ˆu

n−1

) is the MLE of the merits vector u. Since the Bradley–Terry

model is invariant under scaling of parameters, we assume that u

=1, ˆu

=1 for the purpose

of identiﬁability.

As noted by Zermelo (1929)andFord (1957), a necessary and sufﬁcient condition for

existence and uniqueness of the MLE is as follows,

ONDITION A. For every partition of subjects into two nonempty sets, a subject in the

second set has defeated a subject in the ﬁrst at least once.

When Condition A is not satisﬁed, there exists a nonempty set of subjects, say

A, such that

the MLEs of the merits of members in

A would be inﬁnitely larger than those of the members

not in

A. As a result, the MLE cannot be consistent. Under some sparsity conditions given

there, Lemma 1 shows Condition A holds with probability approaching 1 as n →∞.Some

more notations are introduced here:

(2.3)

= max

0≤i,j≤n−1

,



(log n)

[log(np

)]

and

u

ˆu

−u

,i=0,...,n− 1,

where M

is the largest ratio of u

and u

for all i, j , and will be called the largest ratio of

merits.

EMMA 1. If

(2.4)

log n

→0 as n →∞,

then P(Condition A is satisﬁed) → 1 as n →∞.

EMARK 1. The largest ratio of merits, M

controls the spread of the merits in the net-

work, while p

controls the possibility of comparisons. A large M

and a small p

both

increase the likelihood of the existence of a group of subjects such that any member of this

group always wins in a comparison with any member not in this group, and result in Condi-

tion A being violated.

Under condition (2.4), p

can be close to the order of log n/n.GivenT = 1, (t

)

n×n

can

be regarded as the adjacency matrix of the Erd

os-Rényi graph (Erd

os and Rényi (1959)),

denoted as G(n, p

), under our assumption. Erd

os and Rényi (1960) showed that if p

 log n/n, for any positive <1, G(n, p

) is disconnected, disagreeing with Condition A,

with probability tending to 1. Therefore, in order to satisfy Condition A, it is necessary to

require p

≥log n/n. According to (2.4), when we ﬁx M

as a constant, p

nearly meets the

lower bound logn/n.

Condition (2.4)ofLemma1 ensures the existence and uniqueness of the MLE ( ˆu

, ˆu

,...,

ˆu

n−1

). The theorems in this paper all assume conditions that imply (2.4).

2494 HAN, YE, TAN AND CHEN

2.1. Uniform consistency. We ﬁrst deﬁne two notations O

and o

to stand for the or-

der of the sequence of random variables. Given a sequence of random variables {X

} and a

corresponding sequence of constants {a

}. We say that X

= O

) if for any >0, there

exists a ﬁnite M>0 and a ﬁnite N>0 such that P(|X

|>M)<for any n>N.Gen-

erally speaking, X

= O

) denotes X

is stochastically bounded. Another notation

) means that X

converges to zero in probability.

Based on these two notations, we have the following theorem.

HEOREM 2.1. If

(2.5) M



→0 as n →∞,

then

(2.6) max

i=0,...,n−1

|u

|=O







(1).

EMARK 2. The condition imposed on the largest ratio of merits M

and sparsity p

in (2.5) ensures the uniform consistency of the MLE of the Bradley–Terry model when the

comparisons may be sparse and the network is large. For a large value of M

, the teams

with relative poor merits has very little chance to defeat those with relative large merits,

thereby making estimation difﬁcult. Meanwhile, for a small p

, teams have few opportunities

to compete with others, thus making a poor estimation.

To prove (2.6), we let

=argmax

ˆu

=argmin

ˆu

Since ˆu

= 1, it sufﬁces to show that the ratio of subject i

, ˆu

, and the ratio of i

ˆu

are very close.

Review that the main idea of the previous work (Simons and Yao (1999), Yan, Yang and

Xu (2012)) contains two parts. The ﬁrst part is that the number of the common neighbors

between any two subjects is at least cn for some constant c through their dense assumption.

The second part is that for subject i = i

or i

, some subjects j with t

= 0 have the ratio

close to the ratio of i. Then, it can be shown there exists at least one subject, say s (one would

sufﬁce), who is a neighbor of i

with the ratio ˆu

close to ˆu

and is also a neighbor

of i

with the ratio close to ˆu

. Such a common neighbor serves as middleman between

subjects i

and i

. As a result, ˆu

is close to ˆu

. Thus, the uniform consistency holds.

However, in the sparse case, the number of common neighbors of any two subjects tends

to0asn increases to inﬁnity. If we follow the previous proof directly, no common neighbors

of subjects i

and i

may be found, let alone a common neighbor with desired closeness of its

ratio to ˆu

and ˆu

. Due to the absence of such a middleman, this approach cannot

be applied to the sparse comparison.

It appears to be an obvious extension that one might try to ﬁnd a chain of subjects, say

,...,l

, serving as middlemen to bridge the subjects i

and i

. Namely l

i+1

is a neighbor

of l

and they are close in terms of the ratios. An immediate difﬁculty, along with other

technicalities, arises from the evaluation of this closeness since the previous proof only works

for subjects i

and i

. Then, the second extension of our proof is to show for any subject

i =0,...,n−1, some subjects j with t

= 0 have the ratios close to the ratio of i,whichis

summarized in Lemma 3.

With these two extensions, we prove the uniform consistency by showing the existence

of a nonempty intersection between two carefully designed sets. One is the set of subjects

SPARSE BRADLEY–TERRY MODEL 2495

having the ratio close to the maximum ratio ˆu

. The other is the set of subjects having

the ratio close to the minimum ratio ˆu

. For the ﬁrst set, we start from A ={i

} and then

constantly expand the size of A through the neighbors of the subjects in A until |A| >n/2.

Similarly, we can obtain the size of the second set is also larger than n/2. Hence there must

exist a subject, a middleman, with its ratio close to both ˆu

and ˆu

. More details can

be found in Remark 5.

The consistency presented in Theorem 2.1 also applies to the special case of nonsparse

comparisons, where p

has a lower bound away from 0, as considered in the previous work.

Moreover, the special case of M

being a constant sheds light on the sparsity required for the

uniform consistency. These are summarized in the following corollaries.

OROLLARY 1. If M

=C for some constant C ≥ 1 and there exists an n

such that

(2.7) p

≥

(log n)

for all n>n

, then

(2.8) max

i=0,...,n−1

|u

|=O



log(np

)



(1).

OROLLARY 2. If p

=c for some constant c ≤ 1 and

(2.9) M



→0 as n →∞,

then

(2.10) max

i=0,...,n−1

|u

|=O





log n



(1).

EMARK 3. Corollary 1 shows the uniform consistency of the MLE holds when M

and p

≥(log n)

/n, close to the theoretical lower bound log n/n.

2.2. Asymptotic normality. Recall that a



n−1

j=0,j =i

,wherea

is the number of

times that subject i prevails over j .LetV

n−1

=(v

)

i,j=1,...,n−1

denote the covariance matrix

of a

,...,a

n−1

,where

(2.11) v

n−1



k=0

)

=−

)

,i,j=1,...,n− 1;i = j.

Let v



n−1

i,j=1



n−1

k=1

[(t

)/(1 +u

)

].HereV

n−1

is the Fisher information ma-

trix for the parameterization (log u

,...,log u

n−1

). Simons and Yao (1999) used a symmetric

matrix S

n−1

=(s

)

(n−1)×(n−1)

to approximate V

−1

n−1

,where

(2.12) s

,i,j= 1,...,n−1,

and δ

is the Kronecker delta. With sparse comparisons, we shall re-evaluate the accuracy of

this approximation in Lemma 7. The following theorem shows the asymptotic normality of

the MLE.

HEOREM 2.2. If

(2.13)

(log n)

1/5

1/10

→0 as n →∞,

then for each ﬁxed r ≥ 1, as n →∞, the vector (u

,...,u

) is asymptotically normally

distributed with mean 0 and covariance matrix given by the upper left r × r block of S

n−1

deﬁned in (2.12).

2496 HAN, YE, TAN AND CHEN

As expected, condition (2.13)involvesM

and p

. The following two corollaries each

deal with the special cases of M

and p

being constants.

OROLLARY 3. If p

=c for some constant c ≤ 1 and

(2.14) M

(log n)

1/5

1/10

→0 as n →∞,

then for each ﬁxed r ≥ 1, as n →∞, the vector (u

,...,u

) is asymptotically normally

distributed with mean 0 and covariance matrix given by the upper left r × r block of S

n−1

deﬁned in (2.12).

OROLLARY 4. If M

=C for some constant C ≥ 1 and

(2.15) p

1/10

(log n)

1/5

→∞ as n →∞,

then for each ﬁxed r ≥ 1, as n →∞, the vector (u

,...,u

) is asymptotically normally

distributed with mean 0 and covariance matrix given by the upper left r × r block of S

n−1

deﬁned in (2.12).

EMARK 4. Corollary 3 is the theorem about asymptotic normality presented in Simons

and Yao (1999)andYan, Yang and Xu (2012), and Corollary 4 gives the sparsity condition

to ensure asymptotic normality when the largest ratio of merits is bounded above.

3. Numerical studies.

3.1. Simulation. Simulations are carried out to evaluate the ﬁnite sample performance of

the MLE of the Bradley–Terry model. We assume T = 1 in all simulations, which means that

any pair has one comparison with probability p

and no comparison with probability 1 −p

The result of the ﬁrst simulation study is shown in Table 1. In order to study the uniform

consistent tendency of MLE, we present the mean and median of max

i=0,...,n−1

|u

| based

on 1000 repetitions. In this simulation, the size of network n is taken to be 1000, 2000,

5000, 10,000, 15,000, the sparse probability p

is chosen as logn/n, (logn)

/n,10/

√

and M

equals to 1, which implies merits of all subjects are identical. When p

= log n/n,

all the repetitions do not produce the MLE since Condition A is not satisﬁed. In the case

of p

= (log n)

/n, both mean and median of max

i=0,...,n−1

|u

| become closer to 0 with

increasing n. For comparison, we also consider p

as large as 10/

√

n. We multiply 1/

√

n by

10 to ensure there are more paired comparisons than p

= (log n)

/n for values of n in the

simulation. The result shown in Table 1 indeed indicates the consistency of the MLE under

TABLE 1

The mean and median (in parentheses) of max

i=0,...,n−1

|u

|. In the third column, “–” means all repetitions

fail Condition A. The three numbers in the parentheses in the ﬁrst column represent respectively the average

numbers of comparisons one subject has for p

=log n/n, p

=(log n)

/n and p

=10/

√

=log n/n p

=(log n)

/n p

=10/

√

1000 (7/330/316) 1 – 0.4784 (0.4365) 0.4862 (0.4427)

2000 (8/439/447) 1 – 0.4291 (0.4012) 0.4242 (0.3929)

5000 (9/618/707) 1 – 0.3765 (0.3593) 0.3511 (0.3327)

10,000 (9/781/1000) 1 – 0.3423 (0.3223) 0.2988 (0.2867)

15,000 (10/889/1225) 1 – 0.3226 (0.3085) 0.2727 (0.2595)

SPARSE BRADLEY–TERRY MODEL 2497

TABLE 2

Coverage probabilities and the probabilities that Condition A fails (in parentheses). In the ﬁrst column, the

numbers in the parentheses represent the average numbers of comparisons one subject has

n(i,j)M

=1 M

√

=1/

√

100(10) (0, 1) 0.277 (0.703) 0.041 (0.958) 0.001 (0.999)

(0, 99) 0.275 (0.703) 0.041 (0.958) 0.001 (0.853)

(50, 51) 0.278 (0.703) 0.039 (0.958) 0.001 (0.853)

200(14) (0, 1) 0.697 (0.260) 0.124 (0.871) 0.001 (0.999)

(0, 199) 0.696 (0.260) 0.123 (0.871) 0.001 (0.999)

(100, 101) 0.692 (0.260) 0.120 (0.871) 0.001 (0.999)

500(23) (0, 1) 0.932 (0.010) 0.416 (0.562) 0.002 (0.998)

(0, 499) 0.931 (0.010) 0.415 (0.562) 0.002 (0.998)

(250, 251) 0.930 (0.010) 0.410 (0.562) 0.002 (0.998)

√

log n/n

100(21) (0, 1) 0.938 (0.003) 0.888 (0.070) 0.501 (0.483)

(0, 99) 0.940 (0.003) 0.886 (0.070) 0.491 (0.483)

(50, 51) 0.943 (0.003) 0.877 (0.070) 0.483 (0.483)

200(33) (0, 1) 0.944 (0) 0.941 (0.011) 0.710 (0.262)

(0, 199) 0.947 (0) 0.939 (0.011) 0.696 (0.262)

(100, 101) 0.942 (0) 0.931 (0.011) 0.693 (0.262)

500(56) (0, 1) 0.949 (0) 0.949 (0) 0.897 (0.062)

(0, 499) 0.945 (0) 0.948 (0) 0.886 (0.062)

(250, 251) 0.947 (0) 0.944 (0) 0.890 (0.062)

100(99) (0, 1) 0.951 (0) 0.954 (0) 0.954 (0)

(0, 99) 0.952 (0) 0.954 (0) 0.950 (0)

(50, 51) 0.941 (0) 0.948 (0) 0.948 (0)

200(199) (0, 1) 0.954 (0) 0.954 (0) 0.942 (0)

(0, 199) 0.953 (0) 0.957 (0) 0.951 (0)

(100, 101) 0.950 (0) 0.946 (0) 0.951 (0)

500(499) (0, 1) 0.951 (0) 0.953 (0) 0.949 (0)

(0, 499) 0.953 (0) 0.959 (0) 0.950 (0)

(250, 251) 0.955 (0) 0.948 (0) 0.950 (0)

the sparsity condition given in Theorem 2.1. This further supports that our sparsity condition

nearly meets the lower bound of p

to ensure the existence of a unique MLE.

The second simulation is done to measure the coverage probabilities of MLE and the

result based on 5000 repetitions is given in Table 2. By applying Theorem 2.2, we construct

the approximate 1 −α conﬁdence interval for log(u

) as

log( ˆu

/ ˆu

) ±z

1−α/2



1/ ˆv

+1/ ˆv

where z

1−α/2

refers to the quantile of the standard normal distribution at level 1 − α/2. The

asymptotic variances are based on (2.11), and ˆv

and ˆv

are computed using ˆu,theMLE

of merits u. We present the coverage probabilities of 95% conﬁdence intervals of some pairs

of merits (the ﬁrst two merits, the middle two merits, the ﬁrst and the last merits when they

are sorted in ascending order) when Condition A is met. The frequencies that Condition A

fails are also reported. In this simulation, we choose the size of network n = 100, 200, 500,

the sparse probability p

= 1/

√

log n/n, 1 and the largest ratio of merits M

= 1,

√

2498 HAN, YE, TAN AND CHEN

ABLE 3

Coverage probabilities and the probabilities that Condition A fails (in parentheses)

when M

=1 and n = 1000, 2000. In the ﬁrst column, the two numbers in the

parentheses represent the numbers of comparisons one subject has for

√

log n/n and p

=log n/

√

n(i,j)p

√

log n/n p

=log n/

√

1000 (83/218) (0, 1) 0.942 (0) 0.938 (0)

(0, 999) 0.947 (0) 0.943 (0)

(500, 501) 0.943 (0) 0.946 (0)

2000 (123/340) (0, 1) 0.945 (0) 0.954 (0)

(0, 1999) 0.948 (0) 0.950 (0)

(1000, 1001) 0.940 (0) 0.956 (0)

The average numbers of comparisons one subject has under different sparse probabilities are

shown in the parentheses following the numbers of subjects n in Table 2. For example, there

are only around 20 and 50 comparisons for each of 500 subjects in two cases with p

=1/

√

and

√

log n/n respectively.

From Table 2, we see coverage probabilities become closer to the nominal level as p

increases or M

decreases. With n increasing, the coverage probabilities approach the nomi-

nal level and the probabilities that Condition A fails decrease. These phenomena agree with

the theoretical asymptotic properties given in Theorem 2.2. Condition A fails mostly when

=1/

√

n and M

=n, due to extremely sparse comparisons and the large range of merits.

Furthermore, Table 3 reports coverage probabilities for larger network, with n = 1000 and

2000. In this simulation, we let p

√

log n/n,logn/

√

n and M

= 1, which means all

subjects have equal merits. One can conclude that the coverage probabilities are close to the

nominal level from Table 3, showing evidence in support of the theory.

3.2. The ATP dataset. We present results of the Bradley–Terry model applied to the 2017

ATP data and then compare its ranking with the ofﬁcial ATP ranking. The ATP matches of

one year include four Grand Slams, the ATP World Tour Masters 1000, the ATP World Tour

500 series and other tennis series of the year. There are 203 players in total after removing

those who never win or lose for Condition A to be satisﬁed. Besides, we exclude walkovers

and only consider ﬁnished games. The results are reported in Table 4. The estimated merits of

TABLE 4

Results of the analysis of the ATP 2017 data

Rank Player Games Winning rate Merit ATP Ranking

1 Roger Federer 55 0.909 7.505 2

2 Rafael Nadal 76 0.855 4.085 1

3 Novak Djokovic 36 0.806 2.029 12

4 Juan Martin del Potro 53 0.698 1.440 11

5 Alexander Zverev 73 0.712 1.321 4

6 Grigor Dimitrov 65 0.708 1.303 3

7 Nick Kyrgios 38 0.684 1.287 21

8 Milos Raonic 39 0.718 1.136 24

9 Stan Wawrinka 36 0.694 1.043 9

10 Kei Nishikori 42 0.714 1.000 22

SPARSE BRADLEY–TERRY MODEL 2499

TABLE 5

Results of the analysis of the ATP 1968–2016 data

Rank Player Games Winning rate Merit

1 Novak Djokovic 880 0.831 2.788

2 Rafael Nadal 968 0.818 2.286

3 Roger Federer 1287 0.814 2.128

4 Andy Murray 782 0.776 1.744

5 Ivan Lendl 1274 0.820 1.247

6 Pete Sampras 957 0.770 1.128

7 Andy Roddick 784 0.741 1.111

8 John McEnroe 1035 0.813 1.104

9 Juan Martin del Potro 481 0.709 1.066

10 Andre Agassi 1101 0.756 1.000

Bradley–Terry model are given in the ﬁfth column, and, based on which, the ranks are given

in the ﬁrst column. The number of games played in 2017, winning rates and ATP rankings

are also presented. The 10th player, Kei Nishikori, is taken as the baseline (u

=ˆu

= 1).

We note that there is a difference between ranking by the estimated merits and the ATP

ranking. For example, the 7th player in our ranking list, Nick Kyrgios, ranked 21st in the

ATP ranking. Yet he defeated Novak Djokovic twice and Rafael Nadal once in 2017. These

will be counted heavily in the Bradley–Terry model and less so in the points calculation which

the ATP ranking is based on. Notice that Roger Federer and Rafael Nadal have reversed order

in the two ranking systems. In fact, Rafael Nadal had 6 winners and 4 runners-up and more

ATP points in 2017 than Roger Federer, who had 7 winners and 1 runner-up. On the other

hand, Federer defeated Nadal four times in 2017 and had an outstanding winning rate. As a

result, the estimated merit of Federer is higher than that of Nadal.

Moreover, we applied the Bradley–Terry model to the ATP matches from 1968 to 2016.

There are 2877 players in total after data cleaning. All players are ranked by their estimated

merits, and top 10 of them are presented in Table 5. The 10th player, Andre Agassi, is taken as

the baseline. The Big Four, Novak Djokovic, Rafael Nadal, Roger Federer and Andy Murray,

rank top four in the ranking list. They are considered dominant in terms of ranking and

the tournament victories from 2004 onwards. With this dataset and the application of the

Bradley–Terry model, the estimated chance that Federer defeats McEnroe in a hypothetical

match is 2.128/(2.128+1.104) =0.6584, even though the two had very similar winning rates

in reality. We remark that the correctness of this answer is limited by the assumption that the

players’ merits are ﬁxed and may be viewed as averaged over time. Further analysis using

more sophisticated dynamic models, such as the whole history rating (Coulom (2008)), may

be more appropriate. The static model in this paper serves as a basis for further extensions.

4. Discussion. This paper provides an asymptotic theory of the MLE of the Bradley–

Terry model when comparisons between any pair of subjects are sparse. The uniform consis-

tency and asymptotic normality of the MLE are shown under, respectively, conditions (2.5)

and (2.13). Two quantities, the largest ratio of merits, M

, and the probability of paired com-

parison, p

, contribute to the accuracy of the MLE. When M

are bounded, we prove the

uniform consistency holds under nearly minimal condition of sparsity. The results of this

paper may have broad applications and can be further generalized to other models in the

sparse case, such as Plackett–Luce model (Luce (1959)) which ﬁts multiple comparisons, the

Rao–Kupper model (Rao and Kupper (1967)) which allow paired comparisons with ties.

2500 HAN, YE, TAN AND CHEN

APPENDIX A: PROOF OF LEMMA 1

ROOF.LetE

denote the event that Condition A holds. We will show that under condi-

tion (2.4), P(E

) →0asn →∞, that is, the probability that the subjects in the second group

never defeat the subjects in the ﬁrst group, for all partitions of subjects into two nonempty

groups, tends to 0. The proof consists of two steps. Step 1 is about estimating the number of

comparisons between the ﬁrst and second groups. Step 2 is about computing the probability

of E

and its convergence to 0.

Step 1. Let  ={0, 1,...,n−1} be the set of all n subjects, and let 

denote any subset

of  with r subjects. The number of comparisons between 

and 

, denoted by N



, can

be expressed as





i∈

,j∈

Recall that t

is the number comparisons between subjects i and j , which follows binomial

distribution. N



can be viewed as the sum of Tr(n− r) independent identically distributed

Bernoulli random variables with common probability ratio p

We ﬁrst estimate N



for a ﬁxed r. Condition (2.4) implies log n/(np

) → 0asn →∞,

since M

≥ 1. It follows from the Chernoff bound (Chernoff (1952)) that, for a ﬁxed r ∈

{1,...,n/2} and n>32log n/(Tp



min



≤

r(n −r)p



≤









≤

r(n −r)p



≤





sup







≤

r(n −r)p



≤ n

sup







≤

r(n −r)p



≤ exp



−

r(n −r)p

+r log n



≤ exp



−

r(n −r)p



≤ exp



−

(n −1)p



The next-to-last inequality holds due to n>32 log n/(Tp

). The deﬁnition of N



, implies

the symmetry: min



= min



n−r



n−r

. Therefore, for any ﬁxed r ∈{1,...,n− 1} and

n>32log n/(Tp





min

|

|=r





≤

r(n −r)p



≤exp



−

(n −1)p



Next, we estimate the lower bound N



for all r ∈{1,...,n− 1}.LetF

be the event that



>Tr(n−r)p

/2 holds for all r =1,...,n−1 and all partitions of  into 

and 

SPARSE BRADLEY–TERRY MODEL 2501

Then, and n>32 log n/(T p

P(F

) ≥ 1 −

n−1



r=1





min

|

|=r





≤

r(n −r)p



≥ 1 −

n−1



r=1

exp



−

(n −1)p



≥ 1 −exp



−

(n −1)p

+log(n −1)



Therefore, P(F

) → 1asn →∞, since Condition (2.4) ensures n>32 log n/(Tp

) for all

large n.

Step 2. Since M

=max

0≤i,j≤n−1

≥1,

(A.1) max

0≤i,j≤n−1

= max

0≤i,j≤n−1

1 +u

≤

1 +1/M

≤





1/M

Let G

(r)

denote the event that Condition A fails with the ﬁrst group containing r subjects.

Then, by the deﬁnition of F



(r)



≤







max

0≤i,j≤n−1



Tr(n−r)p

≤









Tr(n−r)p







Tr(n−r)p

Recall that E

is the event that Condition A fails. Write







n−1



r=1

(r)





n−1



r=1



(r)



≤

n−1



r=1







Tr(n−r)p

≤2

n/2



r=1







Tr(n−r)p

≤2

n/2



r=1







Trnp

≤2



1 +





Tnp



−1



which tends to 0 as n →∞, ensured by Condition (2.4). With the law of total probability,









P(F

) +P









where F

is the complementary event of F

.SinceP(E

) → 0andP(F

) → 1asn →

∞, it follows that P(E

) →0andP(E

) →1asn →∞. The proof is complete. 

APPENDIX B: PROOF OF UNIFORM CONSISTENCY

In this appendix, we show the proof of uniform consistency described in Theorem 2.1.

Some notations need to be introduced ﬁrst.

2502 HAN, YE, TAN AND CHEN

We ﬁrst make a transformation on ˆu

for i = 0,...,n− 1. Let ˜u

=ˆu

/(max

ˆu

).As

a result, max

0≤i≤n−1

˜u

=1andweset ˜u

=1. In addition, let

(B.1)

K =



2logn

log(np

)



,φ

log(np

)

log n

(1 +M

)

T +φ

={j : t

> 0},d

=I(t

> 0), d

n−1



j=0

and t

n−1



j=0

where · is the ﬂoor function and I(·) is the indicator function. We also need a sequence of

increasing number {D

}

k=1

to present the level of the closeness in terms of the ratios,

=β(1 +φ

)



for k = 0,...,K −1,

=40βT (1 + φ

)



and a sequence of nested or increasing sets {A

}

k=1

to collect the subjects which have ratios

close to max

ˆu

in the level of D

(B.2)



j :

˜u

≥1 −β(1 +φ

)





for k = 0,...,K −1,



j :

˜u

≥1 −40βT (1 +φ

)





where the constant β = 20T .

To prove Theorem 2.1, we need four additional lemmas, whose proofs are given after the

proof of Theorem 2.1. For ease of illustration, we use the sentence that “for all large n,the

condition S

holds” to stand for that there exists n

such that S

holds for all n>n

EMMA 2. Assume condition (2.5) holds. For all large n,

(B.3) P



max

0≤i≤n−1



−Z

−



< 2



log n



≥1 −3n

−3

where

(B.4)



j∈C



˜u

+˜u

−





j∈C

˜u

−˜u

( ˜u

+˜u

)(u

)

−



j∈C

−



˜u

+˜u

−



=−



j∈C

−

˜u

−˜u

( ˜u

+˜u

)(u

)



j :

˜u

,j ∈ C



and C

−



j :

˜u

≤

˜u

,j ∈ C



EMMA 3. Assume condition (2.5) holds. For any i ∈A

where k<K−1, let

(B.5) C

∗



j :j ∈C

˜u

≥1 −β(1 +φ

)

k+1





Then for all large n,





∗



≥q



≥1 −3n

−3

SPARSE BRADLEY–TERRY MODEL 2503

where q

and d

are deﬁned in (B.1). Fo r a ny i ∈ A

K−1

, let

(B.6) C

∗



j :j ∈C

˜u

≥1 −40βT (1 +φ

)





Then for all large n,





∗



≥



≥1 −3n

−3

EMMA 4. Assume condition (2.5) holds. Fo r a set A ⊂ , let s =|A| denote the size

of A. Deﬁne a s et B ={j :there exists i ∈ Asuch thatt

> 0}. If sT < p

−1

, then for all large



|B| >



1 −

√

log n

√





sT np

−s





≥1 −2n

−3sT

If sT = p

−1

, then for all large n,



|B| >



1 −

√

log n

√



≥1 −2n

−3sT

EMMA 5. Assume condition (2.5) holds. Recall A

deﬁned in (B.2), for all large n,



|≥(np

)



≥1 −6kn

−2

for k =0,...,K −2,(B.7)



K−1

|≥



≥1 −6(K −1)n

−2

,(B.8)

and

(B.9) P



|≥

21n



≥1 −6Kn

−2

EMARK 5. We give some insights to the proof of Lemma 5, which used the facts proved

in Lemmas 2 to 4. In particular, Lemma 5 is proved by mathematical induction. For illustra-

tion, we assume that u

= 1foralli. A

is exactly the set that contains subjects with esti-

mators close to the maximum estimator max

0≤i≤n−1

˜u

in the level of D

. We aim to show

there are more than n/2 subjects whose estimators are close to the maximum estimator in the

level of D

. In the ﬁrst step, we begin with A

={i

}. With the use of Lemma 2, we know

−

=0sothatZ

(

√

log n/(np

)). It means there are some ˜u

very close to ˜u

,where

j ∈C

(j is the neighbor of i

). Moreover, Lemma 3 states the proportion of such kind of j

in C

and Lemma 4 gives the size of C

. Eventually, we put i

andsuchkindofj together

to generate the set A

whose size is obtained from Lemma 5.

Next, by Lemma 2,giveni ∈ A

, the relevant quantities Z

and Z

−

associated with sub-

jects in C

and C

−

are balanced. If C

−

has a large size, then C

∗

, the neighbors of i with

estimators so large to be in A

will automatically be large; if C

−

does not have a large size,

the balance of Z

and Z

−

dictates that those in C

would still have a large size of subset

in A

. The detailed arguments are given in Cases 1 and 2 in the proof of Lemma 3. Then,

we ﬁnd the subjects in A

through the neighbors of the subjects in A

. We repeat the process

until the size of A

is larger than n/2.

2504 HAN, YE, TAN AND CHEN

B.1. Proof of Theorem 2.1.

ROOF. The proof mainly contains two parts. One is to show that the number of subjects

whose ratios of estimated merits and real merits ˆu

are close to the largest ratio max

ˆu

is larger than n/2. The other is to show that the number of subjects whose ratios are close to

the smallest ratio is also larger than n/2.

Step 1. Observe that (B.9), with proof given in that of Lemma 5, implies, for all large n,

(B.10) P







j :

˜u

≥1 −40βT (1 +φ

)







≥

21n



≥1 −6Kn

−2

Under condition (2.5), it follows that

K ≤ 2logn and (1 +φ

)

≤e

Let λ = 40βT e

, from (B.10), we obtain for all large n,

(B.11) P







j :

˜u

≥1 −λM







≥

21n



≥1 −12n

−2

log n.

Notice that ˜u

=ˆu

/(max

ˆu

). Thus, (B.11) can be written as







j :

ˆu

max

0≤i≤n−1

( ˆu

)

≥1 −λM







≥

21n



≥1 −12n

−2

log n.

It means that with probability approaching 1 as n →∞,

(B.12)





j :

ˆu

max

0≤i≤n−1

( ˆu

)

≥1 −λM







≥

21n

Step 2. Next, we will show that the number of subjects whose ratios are close to the smallest

ratio is also larger than n/2.

Let ¯u

=ˆu

/(min

ˆu

). Similar to (B.2), we deﬁne



j :

¯u

≤1 +β(1 +φ

)





for k = 0,...,K −1,



j :

¯u

≤1 +40βT (1 +φ

)





Compared with (B.9), we can obtain



|≥

21n



≥1 −6Kn

−2

with the similar proof of Lemma 2 to 5. Similar to Step 1, that with probability approaching

1asn →∞,

(B.13)





j :

ˆu

min

0≤i≤n−1

( ˆu

)

≤1 +λM







≥

21n

Combining ˆu

=1, (B.12)and(B.13), it can be shown that with probability approaching

1asn →∞,

1 −λM



1 +λM



≤ min

0≤i≤n−1

ˆu

≤ max

0≤i≤n−1

ˆu

≤

1 +λM



1 −λM



Consequently,

max

0≤i≤n−1



ˆu

−1



≤

2λM



1 −λM



SPARSE BRADLEY–TERRY MODEL 2505

Since M



→0asn →∞, with probability approaching 1,

max

0≤i≤n−1



ˆu

−1



→0asn →∞.

Except for the deferred proof of Lemma 2 to 5, the proof of Theorem 2.1 is complete. 

B.2. Proof of Lemma 2.

ROOF. The proof of Lemma 2 contains three steps. In the ﬁrst step, we ﬁnd the upper

bound of |a

− E(a

, 0 ≤ j ≤ n − 1)| for ﬁxed i. In the second step, we ﬁnd the upper

bound of |Z

− Z

−

| for ﬁxed i through the ﬁrst step. In the third step, we ﬁnd the uniform

upper bound of |Z

−Z

−

| for i =0,...,n−1.

Step 1. Recall that t



0≤j≤n−1

and a

is the total number of wins of subject i in t

comparisons. Since the outcome of each comparison is independent of other comparisons, a

is the sum of m

independent Bernoulli random variables given t

,forj =0,...,n−

1, where m



n−1

j=0

. With the use of Hoeffding’s inequality (Hoeffding (1963)), we have





−E(a

, 0 ≤j ≤n −1)



≥



2Tt

log n|t

, 0 ≤ j ≤n −1







−E(a

, 0 ≤j ≤n −1)



≥



2Tm

log n|t

, 0 ≤j ≤n −1



≤exp



−(4Tm

log n)/m



=2n

−4T

≤2n

−4

where E(a

, 0 ≤ j ≤ n − 1) is the conditional expectation given t

for 0 ≤ j ≤ n − 1.

Note that the upper bound of the above probability does not depend on m

. With the law of

total probability, for ﬁxed i,





−E(a

, 0 ≤j ≤n −1)



≥



2Tt

log n





···



i,n−1

P(t

, 0 ≤j ≤n −1)

×P





−E(a

, 0 ≤j ≤n −1)



≥



2Tt

log n|t

, 0 ≤j ≤n −1



≤2n

−4



···



i,n−1

P(t

, 0 ≤j ≤n −1)

=2n

−4

Thus with probability at least 1 −2n

−4

,foranyﬁxedi,

(B.14)



−E(a

, 0 ≤j ≤n −1)





2Tt

log n.

Step 2. Recall that the maximum likelihood estimator ˆu

satisﬁes

n−1



j=0

n−1



j=0

ˆu

+ˆu

Since ˜u

=ˆu

/(max

ˆu

), we can rewrite the above equation as,

n−1



j=0

n−1



j=0

˜u

+˜u

2506 HAN, YE, TAN AND CHEN

Then,

−E(a

, 0 ≤j ≤n −1) =

n−1



j=0



˜u

+˜u

−





−Z

−



Basedon(B.14), we can obtain with probability at least 1 −2n

−4



−Z

−





2T logn

Step 3. We ﬁrst ﬁnd the uniform lower bound of t

for i = 0,...,n − 1. Notice that t

is the sum of n independent and identically distributed (i.i.d.) binomial random variables,

Bin(T , p

). It can be also regarded as the sum of Tni.i.d. Bernoulli random variables. With

the use of Chernoff bound (Chernoff (1952)), we have



min

0≤i≤n−1



≤

n−1



i=0





≤n exp



−



Thus, with probability at least 1 −n exp{−(T np

)/12},

(B.15) min

0≤i≤n−1

≥

which means that t

(np

). According to the result of Step 2,



max

0≤i≤n−1



−Z

−



≥



2T logn/ min

0≤i≤n−1



≤

n−1



i=0





−Z

−



≥



2T logn/t



≤n ×2n

−4

=2n

−3

By (B.15), with probability 1 −2n

−3

−n exp{−(T np

)/12},

max

0≤i≤n−1



−Z

−



< 2



log n

Meanwhile, M



→0asn →∞implies (log n)/(np

) →0asn →∞. Therefore,

n exp



−

Tnp



−3

for all large n. As a result, with probability at least 1 −3n

−3

max

0≤i≤n−1



−Z

−



< 2



log n

We complete the proof. 

B.3. Proof of Lemma 3.

ROOF. We ﬁrst consider the case when k<K−1. Recall the deﬁnition of A

is given

in (B.2). For any i ∈A

, we aim to show that for all large n, with probability at least 1−3n

−3



∗



≥q

where C

∗

is deﬁned in (B.5). Observe that C

−

⊂C

∗

for any i ∈A

SPARSE BRADLEY–TERRY MODEL 2507

Case 1. If |C

−

|≥q

,thenwehave



∗



≥



−



≥q

Case 2. If |C

−

| <q

,weset|C

−

|=αd

,whereα<q

. We need to show that for all

large n, with probability at least 1 −3n

−3





j :j ∈C

˜u

> 1 −β(1 +φ

)

k+1







≥(q

−α)d

We use Lemma 2 to prove the above inequality and our proof contains three steps. Recall that

and Z

−

deﬁned in (B.4). The ﬁrst step is to ﬁnd the lower bound of Z

and the upper

bound of Z

−

. The second step is to show that the number of subjects who have comparisons

with i and close ratios to ˆu

is larger than (q

−α)d

,thatis,





j :

˜u

> 1 −βφ

(1 +φ

)



,j ∈ C





≥(q

−α)d

The last step is to prove the subject j included in the above set belongs to C

∗

Step 1. For Z

−

,wehave

−

=−



j∈C

−

˜u

−˜u

( ˜u

+˜u

)(u

)

·t



j∈C

−

˜u

−˜u

( ˜u

+˜u

×u

)(1 +u

)

·t

Since M



→0asn →∞,

β(1 +φ

)



≤βe



≤

for all large n. It can be given that for i ∈A

, j ∈C

−

and for all large n,

≤1 −β(1 +φ

)



≤˜u

≤1.

Therefore, for all large n,

−

≤



j∈C

−

1 −(1 −β(1 + φ

)



)

(1 +u

)(1 +u

)

≤

αβ

(1 +φ

)



The last inequality is from

4 ≤



1 +



1 +



≤

(1 +M

)

Similarly, for Z

, we obtain



j∈C

˜u

−˜u

( ˜u

+˜u

)(u

)



j∈C

˜u

−˜u

( ˜u

+˜u

×u

)(1 +u

)

≥

(1 +M

)



j∈C



1 −

˜u



2508 HAN, YE, TAN AND CHEN

From above, we have bounds for Z

−

and Z

respectively, for all large n,

(B.16) Z

−

≤

αβ

(1 +φ

)



≥

(1 +M

)



j∈C



1 −

˜u



Step 2. AccordingtoLemma2,wehaveforalllargen, with probability at least 1 −3n

−3

−

αβ

(1 +φ

)



≤Z

−Z

−

≤2



log n

Since 

√

log n/(np

), for all large n, with probability at least 1 −3n

−3

≤

αβ

(1 +φ

)





log n



αβ

(1 +φ

)

+2φ





Then, from (B.16), it follows that for all large n, with probability at least 1 − 3n

−3

(1 +M

)



j∈C



1 −

˜u



≤



αβ

(1 +φ

)

+2φ





Notice that |C

|=|C

|−|C

−

|=(1 −α)d

, we can rewrite the above inequality as for all

large n, with probability at least 1 −3n

−3



j∈C



1 −

˜u



≤

T(1 + M

)

(1 −α)



αβ

(1 +φ

)

+2φ





Set the xth percentile of {( ˜u

)/( ˜u

) : j ∈ C

} to be b

,wherex = (1 − q

)/(1 − α).



j∈C



1 −

˜u



=1 −



j∈C

˜u

it follows that for all large n, with probability at least 1 −3n

−3



j∈C

˜u

≥ 1 −

T(1 +M

)

(1 −α)



αβ

(1 +φ

)

+2φ





1 −x + b

x ≥ 1 −

T(1 +M

)

(1 −α)



αβ

(1 +φ

)

+2φ





≥ 1 −

T(1 +M

)

(1 −α)x



αβ

(1 +φ

)

+2φ





Since x = (1 −q

)/(1 − α),0≤ α<q

,whereq

is deﬁned in (B.1), we have for all large

n, with probability at least 1 −3n

−3

≥ 1 −

T(1 +M

)

(1 −q

)



αβ

(1 +φ

)

+2φ





> 1 −

T(1 +M

)

(1 −q

)



(1 +φ

)

+2φ





SPARSE BRADLEY–TERRY MODEL 2509

≥ 1 −

(1 +M

)

T +φ



2(1 +M

)

T +2φ

(1 +φ

)

+2φ





≥ 1 −



(1 +φ

)

(1 +M

)

T +φ

×2φ





≥ 1 −



(1 +φ

)

+8TM

+2φ





Given β =20T , we can rewrite above inequality as

> 1 −βφ

(1 +φ

)



which means





j :

˜u

> 1 −βφ

(1 +φ

)



,j ∈ C





≥(1 −x)(1 −α)d

=(q

−α)d

Step 3. Since i ∈ A

={j : ( ˜u

) ≥ 1 − β(1 +φ

)



},foranyj ∈{j : ( ˜u

( ˜u

)>1 − βφ

(1 + φ

)



,j ∈ C

}, it follows that for all large n, with probability

at least 1 −3n

−3

˜u



1 −βφ

(1 +φ

)





1 −β(1 +φ

)





≥ 1 −βφ

(1 +φ

)



−β(1 +φ

)



= 1 −β(1 +φ

)

k+1



which implies





j :

˜u

> 1 −β(1 +φ

)

k+1



,j ∈ C





≥(q

−α)d

Therefore, for all large n, with probability at least 1 −3n

−3



∗



≥



−





j :

˜u

> 1 −β(1 +φ

)

k+1



,j ∈ C





≥ αd

+(q

−α)d

= q

For k = K − 1, we can obtain the result with the same proof as above except replacing q

with 19/20. Hence, for all large n, with probability at least 1 −3n

−3



∗





j :j ∈C

˜u

≥1 −40βT (1 + φ

)







≥

The proof is complete. 

B.4. Proof of Lemma 4.

ROOF. We present the proof with two steps. The ﬁrst step is to ﬁnd the lower bound of

|B| when A is a deterministic set. The second step is to extend the result of the ﬁrst step to

the case when A is a random set.

Step 1. Let A be a nonrandom set with size s ≤ (Tp

)

−1

.Foranyj ∈ ,

P(t

=0foralli ∈ A) = (1 − p

)

2510 HAN, YE, TAN AND CHEN

Set y = 1 −(1 − p

)

and η

= I(j ∈ B).Soη

= 1ifj has a comparison with someone

in A,otherwiseη

=0. We know that P(η

=1) =y.

Since M



→0asn →∞,4

√

log n<

√

for all large n. Thus, with Chernoff bound

(Chernoff (1952)), for all large n,



|B|≤



1 −

√

log n

√



≤ 2exp



−

16ny logn

2np



= 2exp



−

8(1 −exp(sT log(1 −p

))) log n



≤ 2exp



−

8(1 −exp(−sTp

)) log n



≤ 2exp{−4sT log n}.

Here, the second and third inequalities are based on log(1−x) ≤−x and x ≤ 2(1−exp(−x))

respectively when 0 <x<1.

Step 2. For any set A ⊂ with size s and all large n, it follows that





min

|A|=s



{j :there exists i ∈A such that t

> 0}





≤



1 −

√

log n

√



≤



|A|=s





{j :there exists i ∈A such that t

> 0}



≤



1 −

√

log n

√



≤2





exp{−4sT logn}

≤2n

exp{−4sT logn}

≤2n

−3sT

In summary,





min

|A|=s



{j :there exists i ∈Asuch thatt

> 0}





≤



1 −

√

log n

√



≤2n

−3sT

→0asn →∞.

Since for any set A with the size s,

|B|≥ min

|A|=s



{j :there exists i ∈A such that t

> 0}



Thus, for all large n, with probability at least 1 −2n

−3sT

|B| >



1 −

√

log n

√



ny .

Recall that y = 1 − (1 − p

)

,forsT < p

−1

ny =n



1 −(1 −p

)



≥sT np

−s

while for sT = p

−1

ny =n



1 −(1 −p

)



≥n



1 −e

−1



The proof is complete. 

SPARSE BRADLEY–TERRY MODEL 2511

B.5. Proof of Lemma 5.

ROOF. Our aim is to show that there exists a uniform constant C such that when n>C,

(B.7), (B.8)and(B.9) hold. C is deﬁned as the maximum of n which does not satisfy any

following inequalities,

(B.17)

−3

>nexp



−

Tnp



√

> 4



log n,

βe



≤

and q

√

−8



log n>2,

where q

is deﬁned in (B.1)andβ is deﬁned in (B.9). Since M



→0asn →∞, it ensures

the existence of C. Then, for any n>C, n satisﬁes all inequalities in (B.17).

We show Lemma 5 by mathematical induction.

(1) For k = 0, it is obvious that

|≥



}



=(np

)

=1.

Therefore, we obtain



|≥(np

)



≥1.

(2) For k<K− 2, let

denote the event that |A

|≥(np

)

k/2

happens. Assume that

when n>C,

) ≥1 −6kn

−2

Then we proceed under the condition that event

happens. Without loss of generality, let

|=(np

)

k/2

/T , otherwise we consider any of its subsets with size (np

)

k/2

/T .

For any i ∈ A

,wehaveC

∗

⊂A

k+1

. Hence



i∈A

∗

⊂A

k+1

. Consequently,

(B.18) |A

k+1

|≥





i∈A

∗



≥





i∈A



−



i∈A



∗



Next we estimate



i∈A

\ C

∗

| and |



i∈A

| by Lemma 3 and Lemma 4 respectively.

Note that



i∈A

={j : j has a comparsion with anyone in A

} and

(B.19) |A

(np

)

≤

(np

)

K−3

≤

(np

)

−

BasedonLemma4, we know that when n>C, with probability at least 1 −2n

−3T |A

(B.20)





i∈A



≥



1 −

√

log n

√





(np

)

−(np

)



Meanwhile, from Lemma 3, we know for i ∈ A

,whenn>C, with probability at least

1 −3n

−3



∗



≤(1 −q

where q

and d

are deﬁned in (B.1). As a result, when n>C, with probability at least

1 −3|A

−3

(B.21)



i∈A



∗



≤(1 −q

)



i∈A

2512 HAN, YE, TAN AND CHEN

Based on the Chernoff bound (Chernoff (1952)), the range of d

is given as



≥



1 +

√

log n

√



Tnp



≤ P



≥



1 +

√

log n

√



E(d

)



≤ exp



−

log n



where the ﬁrst inequality is from E(d

) =1 −(1 −p

)

≤Tp

. Consequently,





max

0≤i≤n−1



≥



1 +

√

log n

√



Tnp



≤n exp



−

log n



≤n

−4

which implies



max

0≤i≤n−1





1 +

√

log n

√



Tnp

with probability at least 1 − n

−4

. So we can rewrite (B.21)aswhenn>C, with probability

at least 1 −3|A

−3

−n

−4

(B.22)



i∈A



∗



≤(1 −q

)



1 +

√

log n

√



(np

)

Based on (B.18), (B.20)and(B.22), when n>C, with probability at least 1 −2n

−3T |A

−

3|A

−3

−n

−4

k+1

|≥



1 −

√

log n

√





(np

)

−(np

)



−(1 −q

)



1 +

√

log n

√



(np

)

≥(np

)



−

√

log n

√

−(np

)



Due to 1 ≤|A

|≤n,1−2n

−3T |A

−3|A

−3

−n

−4

≥1 −6n

−2

. Thus, when n>C, with

probability at least 1 −6n

−2

(B.23) |A

k+1

|≥(np

)



−

√

log n

√

−(np

)



According to (B.19), (np

)

k/2

≤(np

)

−1/2

. Meanwhile, when n>C, from (B.17), we have

√



−

√

log n

√

−(np

)



= q

√

−8



log n − (np

)

√

≥ 2 −(np

)

√

≥ 2 −





−

×p

√

=1.

Hence, we can rewrite (B.23)aswhenn>C, with probability at least 1 −6n

−2

k+1

|≥(np

)

k+1

That is, when n>C,



k+1

|≥(np

)

k+1



≥1 −6n

−2

Given P(

) ≥1 −6kn

−2

, we obtain when n>C,



k+1

|≥(np

)

k+1



≥1 −6(k +1)n

−2

SPARSE BRADLEY–TERRY MODEL 2513

(3) For k = K −2, we assume that when n>C,



K−2

|≥(np

)

K−2



≥1 −6(K −2)n

−2

Notice that (np

)

(K−2)/2

≥ (np

)

−1/2

. We choose a subset of A

K−2

with size (np

)

−1/2

/T .

Proceed similarly as (2), so when n>C, with probability at least 1 −6(K −1)n

−2

K−1

|≥



1 −

√

log n

√



√

−



−(1 −q

)



1 +

√

log n

√



√

≥

√



−

√

log n

√

−

√



≥

(4) For k = K −1, we assume that when n>C,



K−1

|≥



≥1 −6(K −1)n

−2

We choose a subset of A

K−1

with size (Tp

)

−1

and can complete the proof similar to (2) by

replacing q

with 19/20 and using the second case of Lemma 3 and Lemma 4. Hence, when

n>C, with probability at least 1 − 6Kn

−2

|≥



1 −

√

log n

√



n −



1 +

√

log n

√



n −

√

log n

√

≥

The proof is complete. 

APPENDIX C: PROOF OF ASYMPTOTIC NORMALITY

Now we will sketch the proof of Theorem 2.2. Similar to Simons and Yao (1999), we need

the following lemmas.

EMMA 6. If

(C.1) δ

=32M



log n

(n −1)p

→0 as n →∞,

then max

i=0,...,n−1

|u

|=O

(δ

) →0 as n →∞.

EMMA 7. If

n−1

:=V

−1

n−1

−S

n−1

then with probability approaching 1 as n →∞,

W

n−1

≤

256T

(n −1)

where A=max

i,j

| for the matrix A = (a

2514 HAN, YE, TAN AND CHEN

Lemma 7 evaluates the quality of the approximation S

n−1

,forV

−1

n−1

. This idea was ﬁrst

proposed by Simons and Yao (1998). We are able to establish analogous results with the

sparser probability.

Let a =(a

,...,a

n−1

)



,wherea

is deﬁned in the (2.2)fori = 1,...,n−1.

EMMA 8. If R

n−1

denotes the covariance matrix of W

n−1

a, then with probability ap-

proaching 1 as n →∞,

R

n−1

≤

256T

(n −1)

48TM

(n −1)

As a

is a sum of independent bounded random variables, if v

diverges, a

− E(a

) is

asymptotically normal with variance v

(Loève (1977), page 289) and the following lemma

is derived.

EMMA 9. If M

= o(n) as n →∞, then, as n →∞, the components of (a

−

E(a

),...,a

− E(a

)) are asymptotically independent and normally distributed with vari-

ances v

,...,v

, respectively, foreachﬁxedintegerr ≥ 1. Moreover, the ﬁrst r rows of

n−1

(a − E(a)) are asymptotically normal with covariance m atrix given by the upper left

r ×r block of S

n−1

, for ﬁxed r ≥1.

ROOF OF THEOREM 2.2. Recall that E

is the event that Condition A holds and let G

be the event that

max

0≤i≤n−1

|u

|≤32TM



log n

(n −1)p

It follows from Lemma 1 and Lemma 6 that P(E

∩G

) →1asn →∞. We proceed under

the condition that event E

∩G

happens. Let

(u

−u

)

,ξ

n−1



j=0,j =i

ξ =(ξ

,...,ξ

n−1

)



, η =(η

,...,η

n−1

)



=a −E(a) −ξ ,η

n−1



j=1

It follows that with probability approaching 1 as n →∞,

(C.2)

|η

|≤2v

max

0≤j≤n−1

|u

≤v

log n

(n −1)p

,i=1,...,n− 1,



n−1

η)



≤

|η

|≤

log n

(n −1)p



log n



where v

is deﬁned in (2.11). With the use of Chernoff bound (Chernoff (1952)), it is easy

to show that, with probability approaching 1 as n →∞,

(C.3)

Tnp

≤

+1)

min

0≤i≤n−1

≤v

≤

max

0≤i≤n−1

≤

3Tnp

According to Lemma 7 and (C.3),

(C.4)



n−1

η)



≤

256T

(n −1)

n−1



i=1

|η

|=O



log n



SPARSE BRADLEY–TERRY MODEL 2515

By (C.2)and(C.4),





−1

n−1





≤



n−1

η)



n−1

η)





log n





log n



Since ξ =V

n−1

u,whereu = (u

,...,u

n−1

)



, it can be obtained that

(C.5)

u =V

−1

n−1

−1

n−1



a −E(a)



−V

−1

n−1



a −E(a)



n−1



a −E(a)



−V

−1

n−1

η.

When (2.13) holds, |(V

−1

n−1

η)

|=o

−1/2

), and by Lemma 8, |(W

n−1

(a − E(a)))

−1/2

).So(C.5) is equivalent to

u



−1

n−1





n−1



a −E(a)





−1/2



Following Lemma 9, the proof is complete. 

Acknowledgments. The authors are thankful to the Editor, the Associate Editor and ref-

erees for their constructive comments. The research of Kani Chen is supported by Hong Kong

Research Grant Council grants 16300714 and 16309816.

REFERENCES

AGRESTI, A. (1990). Categorical Data Analysis. Wiley Series in Probability and Mathematical Statistics: Applied

Probability and Statistics. Wiley, New York. MR1044993

RADLEY,R.A.andTERRY, M. E. (1952). Rank analysis of incomplete block designs. I. The method of paired

comparisons. Biometrika 39 324–345. MR0070925 https://doi.org/10.2307/2334029

HERNOFF, H. (1952). A measure of asymptotic efﬁciency for tests of a hypothesis based on the sum of obser-

vations. Ann. Math. Stat. 23 493–507. MR0057518 https://doi.org/10.1214/aoms/1177729330

OULOM, R. (2008). Whole-history rating: A Bayesian rating system for players of time-varying strength. In

International Conference on Computers and Games 113–124. Springer, Berlin.

OS,P.andRÉNYI, A. (1959). On random graphs. I. Publ. Math. Debrecen 6 290–297. MR0120167

OS,P.andRÉNYI, A. (1960). On the evolution of random graphs. Magy. Tud. Akad. Mat. Kut. Intéz. Közl. 5

17–61. MR0125031

ORD,L.R.JR. (1957). Solution of a ranking problem from binary comparisons. Amer. Math. Monthly 64 28–33.

MR0097876 https://doi.org/10.2307/2308513

OEFFDING, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc.

58 13–30. MR0144363

OÈVE, M. (1977). Probability Theory. I,4thed.Graduate Texts in Mathematics 45. Springer, New York.

MR0651017

UCE, R. D. (1959). Individual Choice Behavior: A Theoretical Analysis. Wiley, New York. MR0108411

AYSTRE,L.andGROSSGLAUSER, M. (2015). Fast and accurate inference of Plackett–Luce models. In Ad-

vances in Neural Information Processing Systems 172–180.

EGAHBAN,S.,OH,S.andSHAH, D. (2012). Iterative ranking from pair-wise comparisons. In Advances in

Neural Information Processing Systems 2474–2482.

AO,P.V.andKUPPER, L. L. (1967). Ties in paired-comparison experiments: A generalization of the Bradley–

Terry model. J. Amer. Statist. Assoc. 62 194–204. MR0217963

IMONS,G.andYAO, Y.-C. (1998). Approximating the inverse of a symmetric positive deﬁnite matrix. Linear

Algebra Appl. 281 97–103. MR1645343 https://doi.org/10.1016/S0024-3795(98)10038-1

IMONS,G.andYAO, Y.-C. (1999). Asymptotics when the number of parameters tends to inﬁnity in the Bradley–

Terry model for paired comparisons. Ann. Statist. 27 1041–1060. MR1724040 https://doi.org/10.1214/aos/

1018031267

AN,T.,YANG,Y.andXU, J. (2012). Sparse paired comparisons in the Bradley–Terry model. Statist. Sinica 22

1305–1318. MR2987494 https://doi.org/10.5705/ss.2010.299

ERMELO, E. (1929). Die Berechnung der Turnier–Ergebnisse als ein Maximumproblem der Wahrscheinlichkeit-

srechnung. Math. Z. 29 436–460. MR1545015 https://doi.org/10.1007/BF01180541