Accelerating Greedy Coordinate Descent Methods

1
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Accelerating Greedy Coordinate Descent Methods
Haihao (Sean) Lu, Robert M. Freund, and Vahab Morrokni
MIT and Google Research
ISMP Bordeaux, July 2018

2
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Paper
Conference paper:
“Accelerating Greedy Coordinate Descent Methods”
to be presented at ICML Stockholm July 2018

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Literature on Coordinate Descent

Lots of excellent papers, here are some:

Beck and Tetruashvili, On the convergence of block coordinate descent type

methods

Fercoq and Richtarik, Accelerated, parallel, and proximal coordinate descent

Gurbuzbalaban, Ozdaglar, Parrilo,Vanli, When cyclic coordinate descent

outperforms randomized coordinate descent

Lee and Sidford, Eﬃcient accelerated coordinate descent methods and faster

algorithms for solving linear systems

Lin, Mairal, and Harchaoui, A universal catalyst for ﬁrst-order optimization

Locatello, Raj, Reddy, R¨atsch, Sch¨olkopf, Stich, Jaggi, On matching pursuit and

coordinate descent

Lu and Xiao, On the complexity analysis of randomized block-coordinate

descent methods

Nesterov, Eﬃciency of coordinate descent methods on huge-scale optimization

problems

Nutini, Schmidt, Laradji, Friedlander, and Koepke, Coordinate descent

converges faster with the Gauss-Southwell rule than random selection

Richtarik and Takac, Iteration complexity of randomized block-coordinate

descent methods for minimizing a composite function

Wilson, Recht, and Jordan, A Lyapunov analysis of momentum methods in

optimization

4
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Outline
Accelerated Coordinate Descent Framework
Accelerated Semi-Greedy Coordinate Descent (ASCD)
ASCD under Strong Convexity
Accelerated Greedy Coordinate Descent
Numerical Experiments

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Problem of Interest, and Coordinate-wise L-smoothness

P : f

∗

:= minimum

f (x)

s.t. x ∈ R

where f (·) is a diﬀerentiable convex function

Coordinate-wise L-smoothness

f (·) is coordinate-wise L-smooth for the vector of parameters

L := (L

, L

, . . . , L

) if for all x ∈ R

and h ∈ R it holds that:

|∇

f (x + he

) −∇

f (x)| ≤ L

|h| , i = 1, . . . , n ,

where ∇

f (·) denotes the i

coordinate of ∇f (·) and e

is i

unit

coordinate vector, for i = 1, . . . , n.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Coordinate-wise L-smoothness and L notation

Coordinate-wise L-smoothness

f (·) is coordinate-wise L-smooth for the vector of parameters

L := (L

, L

, . . . , L

) if for all x ∈ R

and h ∈ R it holds that:

|∇

f (x + he

) −∇

f (x)| ≤ L

|h| , i = 1, . . . , n ,

where ∇

f (·) denotes the i

coordinate of ∇f (·) and e

is i

unit

coordinate vector, for i = 1, . . . , n.

Deﬁne the norm kxk

i=1

and dual norm kvk

−1

i=1

−1

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Coordinate Descent Framework

Accelerated Coordinate Descent Framework (without Strong Convexity)

Given f (·) with coordinate-wise smoothness parameter L, initial point x

and

:= x

. Deﬁne step-size parameters θ

∈ (0, 1] recursively by θ

:= 1 and θ

i+1

satisﬁes

i+1

−

i+1

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

(by some rule)

Compute x

k+1

:= y

−

∇

f (y

Choose coordinate j

(by some rule)

Compute z

k+1

:= z

−

∇

f (y

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Randomized Coordinate Descent (ARCD)

Accelerated Randomized Coordinate Descent (ARCD) is the speciﬁcation:

Accelerated Randomized Coordinate Descent (ARCD) (without Strong

Convexity)

Given f (·) with coordinate-wise smoothness parameter L, initial point x

and

:= x

. Deﬁne step-size parameters θ

∈ (0, 1] recursively by θ

:= 1 and θ

i+1

satisﬁes

i+1

−

i+1

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

by j

:∼ U[1, ··· , n]

Compute x

k+1

:= y

−

∇

f (y

Choose coordinate j

by j

= j

Compute z

k+1

:= z

−

∇

f (y

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

On Accelerated Randomized Coordinate Descent (ARCD)

ARCD is well-studied

ARCD updates 1 coordinate per iteration, hence x

is k-sparse

avoids computation of full gradient, which can save computation (or

not) depending on the application

randomization of x -update slows objective function improvement in

practice

Accelerated convergence guarantee (in expectation), for example

[FR2015] :



f (x

) −f (x

∗

)



≤

(k+1)

∗

− x

where the expectation is on the random variables used to deﬁne the

ﬁrst k iterations

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Greedy Coordinate Descent (AGCD)

Accelerated Greedy Coordinate Descent (AGCD) is the speciﬁcation:

Accelerated Greedy Coordinate Descent (AGCD) (without Strong Convexity)

Given f (·) with coordinate-wise smoothness parameter L, initial point x

and

:= x

. Deﬁne step-size parameters θ

∈ (0, 1] recursively by θ

:= 1 and θ

i+1

satisﬁes

i+1

−

i+1

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

by j

:= arg max

√

|∇

f (y

Compute x

k+1

:= y

−

∇

f (y

Choose coordinate j

by j

= j

Compute z

k+1

:= z

−

∇

f (y

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

On Accelerated Greedy Coordinate Descent (AGCD)

AGCD has not been studied in the literature (that we are aware of)

AGCD updates 1 coordinate per iteration, hence x

is k-sparse

AGCD computes the full gradient at each iteration, which can be

expensive (or not) depending on the application

the greedy nature of the x-update speeds convergence in practice

no convergence results known for AGCD, in fact we suspect that

there are examples where O(1/k

) convergence fails

we observe O(1/k

) for AGCD in practice

we will argue (later on) why O(1/k

) fails in theory

we will also argue why O(1/k

) is observed in practice

12
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Accelerated Semi-greedy Coordinate Descent (ASCD)
Accelerated Semi-greedy Coordinate Descent
(ASCD)

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Semi-greedy Coordinate Descent (ASCD)

Accelerated Semi-greedy Coordinate Descent (ASCD) is the speciﬁcation:

Accelerated Semi-greedy Coordinate Descent (ASCD) (without Strong

Convexity)

Given f (·) with coordinate-wise smoothness parameter L, initial point x

and

:= x

. Deﬁne step-size parameters θ

∈ (0, 1] recursively by θ

:= 1 and θ

i+1

satisﬁes

i+1

−

i+1

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

by j

:= arg max

√

|∇

f (y

Compute x

k+1

:= y

−

∇

f (y

Choose coordinate j

by j

:∼ U[1, ··· , n]

Compute z

k+1

:= z

−

∇

f (y

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

On Accelerated Semi-greedy Coordinate Descent (ASCD)

ASCD and its complexity analysis is the new theoretical contribution

of this paper

ASCD updates 2 coordinates per iteration, hence x

is 2k-sparse

computes the full gradient at each iteration, which can be expensive

(or not) depending on the application

the greedy nature of the x-update speeds convergence in practice

Accelerated convergence guarantee on next slide . . .

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Computational Guarantee for Accelerated Semi-greedy

Coordinate Descent (ASCD)

At each iteration k of ASCD the random variable j

is introduced, and

therefore x

depends on the realization of the random variable

:= {j

, . . . , j

k−1

}

Theorem: Convergence Bound for Accelerated Semi-greedy Coordinate

Descent (ASCD)

Consider the Accelerated Semi-Greedy Coordinate Descent algorithm. If

f (·) is coordinate-wise L-smooth, it holds for all k ≥ 1 that:



f (x

) −f (x

∗

)



≤

k−1

∗

− x

≤

(k+1)

∗

− x

16
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Accelerated Semi-greedy Coordinate Descent (ASCD)
under Strong Convexity
Accelerated Semi-greedy Coordinate Descent
(ASCD)
under Strong Convexity

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Semi-greedy Coordinate Descent (ASCD)

under Strong Convexity

We begin with the deﬁnition of strong convexity with respect to k · k

due to [LX2015]:

µ-strong convexity with respect to k · k

f (·) is µ-strongly convex with respect to k · k

if for all x, y ∈ R

it holds

that:

f (y) ≥ f (x) + h∇f (x), y − xi +

ky − xk

Note that µ can be viewed as an extension of the condition number of

f (·) in the traditional sense since µ is deﬁned relative to the coordinate

smoothness coeﬃcients through k · k

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Coordinate Descent Framework under Strong

Convexity

Accelerated Coordinate Descent Framework (µ-strongly convex case)

Given f (·) with coordinate-wise smoothness parameter L and strong convexity

parameter µ > 0, initial point x

and z

:= x

. Deﬁne the parameters

a =

√

and b =

µa

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

(by some rule)

Compute x

k+1

:= y

−

∇

f (y

Compute u

Choose coordinate j

(by some rule)

Compute z

k+1

= u

−

∇f

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Semi-Greedy Coordinate Descent under Strong

Convexity

Accelerated Semi-greedy Coordinate Descent (µ-strongly convex case)

Given f (·) with coordinate-wise smoothness parameter L and strong convexity

parameter µ > 0, initial point x

and z

:= x

. Deﬁne the parameters

a =

√

and b =

µa

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

by j

:= arg max

√

|∇

f (y

Compute x

k+1

:= y

−

∇

f (y

Compute u

Choose coordinate j

by j

:∼ U[1, ··· , n]

Compute z

k+1

= u

−

∇f

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Computational Guarantee for Accelerated Semi-greedy

Coordinate Descent (ASCD) under Strong Convexity

Theorem: Convergence Bound for Accelerated Semi-greedy Coordinate

Descent (ASCD) under Strong Convexity

Consider the Accelerated Semi-Greedy Coordinate Descent algorithm in

the strongly convex case. If f (·) is coordinate-wise L-smooth and

µ-strongly convex, it holds for all k ≥ 1 that:



f (x

) − f

∗

+ b)kz

− x

∗



≤



1 −

√





f (x

) − f

∗

+ b)kx

− x

∗



In particular, it holds that:



f (x

) −f

∗



≤



1 −

√





f (x

) −f

∗

+ b)kx

− x

∗



Observe that this is an accelerated linear convergence rate ≈ (1 −

√

u/n)

21
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Accelerated Greedy Coordinate Descent (AGCD)
Accelerated Greedy Coordinate Descent
(AGCD)

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Accelerated Greedy Coordinate Descent (AGCD) (without

Strong Convexity)

Accelerated Greedy Coordinate Descent (AGCD)

Given f (·) with coordinate-wise smoothness parameter L, initial point x

and

:= x

. Deﬁne step-size parameters θ

∈ (0, 1] recursively by θ

:= 1 and θ

i+1

satisﬁes

i+1

−

i+1

For k = 1, 2, . . ., do:

Deﬁne y

:= (1 − θ

+ θ

Choose coordinate j

by j

:= arg max

√

|∇

f (y

Compute x

k+1

:= y

−

∇

f (y

Choose coordinate j

by j

= j

Compute z

k+1

:= z

−

∇

f (y

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Why AGCD fails (in theory)

In the discrete-time setting, one can construct a Lyapunov energy

function of the form:

= A

(f (x

) −f

∗

) +

∗

− z

where A

is a parameter sequence with A

∼ O(k

Virtually all proof techniques for acceleration methods can be equivalently

written as showing that E

is non-increasing in k, thereby yielding:

f (x

) −f

∗

≤

= O



1/k



Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Why AGCD fails (in theory), continued

= A

(f (x

) −f

∗

) +

∗

− z

In AGCD the greedy coordinate j

is chosen to yield the greatest

guaranteed decrease in f (·).

But one needs to prove a decrease in E

, which is not the same as a

decrease in f (·).

The coordinate j

is not necessarily the greedy coordinate for E

due to

the presence of the second term kx

∗

− z

This explains why the greedy coordinate can fail to decrease E

, at least

in theory.

Because x

∗

is not known when running AGCD, there does not seem to be

any way to ﬁnd the greedy descent coordinate for the energy function E

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Why AGCD fails (in theory), continued

= A

(f (x

) −f

∗

) +

∗

− z

In ASCD:

we use the greedy coordinate to perform the x-update (which

corresponds to the best coordinate decrease for f (·))

we choose a random coordinate to perform the z-update (which

corresponds to the second term in the energy function)

This tackles the problem of dealing with the second term of the energy

function.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

A concurrent paper with similar notions: [LRRRSSJ2018]

Locatello, Raj, Reddy, R¨atsch, Sch¨olkopf, Stich, Jaggi, On matching

pursuit and coordinate descent, ICML 2018

Develops computational theory for matching pursuit algorithms, which

can be viewed as a generalized version of greedy coordinate descent

where the directions do not need to be orthogonal

The paper also develops an accelerated version of the matching pursuit

algorithms, which turns out to be equivalent to ASCD when the chosen

directions are orthogonal

Both works use a decoupling of the coordinate update for the {x

}

sequence (with a greedy rule) and the {z

} sequence (with a randomized

rule)

[LRRRSSJ2018] is consistent with the argument here as to why one

cannot accelerate greedy coordinate descent in general

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

How to make AGCD work (in theory)

Consider the following technical condition:

Technical Condition

There exists a positive constant γ and an iteration number K such that

for all k ≥ K it holds that:

i=0

h∇f (y

), z

− x

∗

i ≤

i=0

∇

f (y

)(z

− x

∗

) ,

where j

= arg max

√

|∇

f (y

)| is the greedy coordinate at iteration i.

We will give some intuition on this in a couple of slides. But ﬁrst . . .

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Computational Guarantee for Accelerated Greedy

Coordinate Descent (AGCD) under the Technical Condition

Theorem: Convergence Bound for Accelerated Greedy Coordinate

Descent (AGCD)

Consider the Accelerated Greedy Coordinate Descent algorithm. If f (·) is

coordinate-wise L-smooth and satisﬁes the Technical Condition with

constant γ ≤ 1, then it holds for all k ≥ K that:

f (x

) −f (x

∗

) ≤

(k+1)

∗

− x

(The Technical Condition arises from a reverse engineering of the

structure of the acceleration proof.)

Note that if γ < 1 (which we always observe in practice), then AGCD will

have a better convergence guarantee than ASCD or ARCD.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Why the Technical Condition ought to hold in general

Technical Condition

There exists a positive constant γ and an iteration number K such that for all

k ≥ K it holds that:

i=0

h∇f (y

), z

− x

∗

i ≤

i=0

∇

f (y

)(z

− x

∗

) ,

where j

= arg max

√

|∇

f (y

)| is the greedy coordinate at iteration i .

The three sequence {x

}, {y

} and {z

} ought to all converge to x

∗

Thus we can instead consider the inner product h∇f (y

), y

− x

∗

For any j we have |y

− x

∗

| ≥

|∇

f (y

)|, and therefore

|∇

f (y

) ·(y

− x

∗

)| ≥

|∇

f (y

The greedy coordinate is chosen by j

:= arg max

|∇

f (y

It is reasonably likely that in most cases the greedy coordinate will yield a

better product than the average of the components of the inner product.

30
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Numerical Experiments
Numerical Experiments

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Numerical Experiments

Linear Regression Problems

least squares minimization: min

ky − X βk

synthetic instances in order to control condition number

κ(X

X )

n = 200, p = 100

Logistic Regression Problems

logistic loss minimization: min

i=1

ln(1 + exp(−y

))

real problem instances taken from LIBSVM

locally strongly convex with parameter µ, we assigned

parameter ¯µ in the experiments

32
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Linear Regression Experiments
Linear Regression Experiments

33
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Prototypical Comparison of ARCD, ASCD, and AGCD on
Linear Regression Problems
Figure: Plot showing the optimality gap versus run-time (in seconds) for
a synthetic linear regression instance solved by ARCD, ASCD, AGCD.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Comparing the Methods on Linear Regression Problems

with Diﬀerent Conditions Numbers κ(X

X )

κ = 10

κ = ∞

Algorithm

Framework 1

(non-strongly

convex)

Algorithm

Framework 2

(strongly

convex)

Plots showing the optimality gap versus run-time (in seconds) for

synthetic linear regression problems solved by ARCD, ASCD, AGCD.

35
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Logisitic Regression Experiments
Logistic Regression Experiments

36
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Prototypical Comparison of ARCD, ASCD, and AGCD on
Logistic Regression Problems
Figure: Plot showing the optimality gap versus run-time (in seconds) for
the logistic regression instance a1a solved by ARCD, ASCD, AGCD.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Comparing the Methods on Logistic Regression Problems

with Diﬀerent Assigned Strong Convexity Parameters ¯µ

Dataset

¯µ = 10

−3

¯µ = 10

−5

¯µ = 10

−7

¯µ = 0

w1a

a1a

Plots showing the optimality gap versus run-time (in seconds) for the

logistic regression instances w1a and a1a in LIBSVM, solved by ARCD,

ASCD, AGCD.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Empirical Values of γ arising from the Technical Condition

Dataset γ

w1a 0.25

a1a 0.17

heart 0.413

madelon 0.24

rcv1 0.016

Largest observed values of γ for ﬁve diﬀerent datasets in LIBSVM for

k ≥

K := 5000.

39
Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End
Comparing the Algorithms using Running Time and the
Number of Iterations
Plots showing the optimality gap versus run-time (in seconds) on the left
and versus the number of iterations on the right, for the logistic
regression instance madelon, solved by ARCD, ASCD, AGCD.

Accelerated Coordinate Descent Framework ASCD ASCD under Strong Convexity AGCD Numerical Experiments End

Conclusions/Remarks

AGCD:

the natural accelerated version of Greedy Coordinate Descent

unlikely that AGCD has an acceleration guarantee (O(1/k

))

exhibits acceleration in practice

extremely eﬀective in practice

Technical Condition “explains” acceleration in practice

ASCD:

new theoretical contribution of this paper

combines salient features of AGCD and ARCD

acceleration guarantee (O(1/k

))

accelerated linear convergence with rate ≈ (1 −

√

µ/n) in strongly convex

case

very eﬀective in practice

We thank Martin Jaggi as well as three excellent anonymous referees