Eco311 Optional Reading: Standard Error and Confidence Interval

Eco311 Optional Reading: Standard Error and Conﬁdence Interval

(Jing Li, Miami University)

1. This note uses Monte Carlo simulation to help students understand concepts of stan-

dard error and conﬁdence interval.

2. Statistics is about using samples to understand population. An important fact is that

there are many samples for a given population. For instance, the population can be

all students at Miami university. Then students taking a class in Laws Hall room

304 at 3pm on Monday can be a sample. Another sample can be students taking a

class in room FSB 0019, or students eating at Chipotle. Most likely, we get diﬀerent

results from diﬀerent samples. A key issue in statistics is accounting for the variation

or uncertainty in those diﬀerent estimates.

3. Recall the math we did in class: if you obtain many random or iid samples, and

compute many sample means, then the variance of those many sample means is

var(¯y) =

(1)

where σ

= var(y) is the variance of data. The square root of var(¯y) is standard error

se, while the square root of var(y) is standard deviation sd:

se =

√

(2)

sd =

√

= σ (3)

The former measures variation in ¯y, while the latter measures the variation in y. Do

not confuse them.

4. From (2) it is obvious that as the sample size n rises, the standard error

√

falls.

This fact implies that as samples get larger, the variation in ¯y across diﬀerent samples

gets smaller, or equivalently, ¯y becomes preciser. In light of this, the standard error

can serve as a measurement of uncertainty or unpreciseness of sample estimates. We

prefer precise estimates or small standard error. We can achieve that goal by using

large samples.

5. Next we run Monte Carlo simulation, and create a sample of 1000 observations of

random values that follow a Bernoulli distribution whose true population mean is

µ = P (y = 1) = 0.3.

The sample mean ¯y for this particular sample is 0.313, close to µ. The standard error

of ¯y is 0.01467127, and the 95 conﬁdence interval for µ is (0.28421, 0.34179), which

contains the true value µ = 0.3.

> n = 1000

> library(purrr)

> set.seed(12345)

> ptrue = 0.3

> data = as.numeric(rbernoulli(n, p = ptrue))

> mean(data)

[1] 0.313

> se = sd(data)/sqrt(length(data))

> se

[1] 0.01467127

> t.test(data, conf.level = 0.95)$conf.int

[1] 0.28421 0.34179

6. To ﬁgure out the meaning of the standard error 0.01467127, let’s create another 10000

random samples (each has 1000 observations), and compute 10000 sample means

> iter = 10000

> v.ybar = rep(NA, iter)

> v.sd = rep(NA, iter)

> count95 = 0

> for (i in 1:iter) {

+ data = as.numeric(rbernoulli(n, p = ptrue))

+ n = length(data)

+ v.ybar[i]=mean(data)

+ v.sd[i] = sd(data)

+ count95 = count95 + (ptrue>v.ybar[i]-1.96*v.sd[i]/sqrt(n))*(ptrue<v.ybar[i]+1.96*v.sd[i]/sqrt(n))

+ }

> sd(v.ybar)

[1] 0.01463623

> count95/iter

[1] 0.9462

(a) The vector v.ybar stores the 10000 sample means. The ﬁrst ﬁve sample means

are

> v.ybar[1:5]

[1] 0.323 0.300 0.283 0.289 0.294

It is clear that those ﬁve sample means vary. This illustrates sampling variation.

Because samples are large here, the ﬁve sample means are all close to µ = 0.3.

(b) The standard deviation of all 10000 sample means is

> sd(v.ybar)

[1] 0.01463623

which is very close to the standard error 0.01467127 we obtained before.

(d) Next we compute the 95 conﬁdence interval for each of the 10000 sample. The

lower and upper bounds are

v.ybar[i]-1.96*v.sd[i]/sqrt(n), v.ybar[i]+1.96*v.sd[i]/sqrt(n)

In the end we get 10000 conﬁdence intervals. 94.62 percent or about 95 percent

of those intervals contain µ = 3.

(e) Lesson 2: if you were to take many random samples from the same population

and calculate a conﬁdence interval from each sample, approximately 95 percent

of those intervals would contain the true population parameter. In other words,

you can be reasonably conﬁdent (95 percent conﬁdent) that the true parameter

lies within the given interval.

(f) Warning: it’s important to emphasize that the interpretation of conﬁdence inter-

val does not mean that there is a 95 percent probability that the true parameter

lies within the interval; rather, it reﬂects the conﬁdence in the estimation proce-

dure used to construct the interval.