Convergence of an accelerated distributed optimisation algorithm over time-varying directed networks

In this article, studying distributed optimisation over time-varying directed networks where a group of agents aims at cooperatively minimising a sum of local objective functions


INTRODUCTION
In this paper, we consider a network of m agents to collaboratively resolve the following optimisation problem miñ x∈ℝ nf objective functions and (log k∕k) for strongly convex objective functions, where k is the iteration number. Then, Jadbabaie et al. [5] leverage a constant step-size to achieve faster convergence rate but inexact convergence. In addition, some algorithms based on Lagrangian dual [12][13][14] also achieve faster convergence at the cost of higher computational burden. In order to improve convergence and attain computational simplicity, Jakovetic et al. [15] utilises a Nesterov method to reach convergence rate at (log k∕k 2 ) for smooth convex objective functions with bounded gradients. Noting that all the methods mentioned above achieve sublinear convergence rate, EXTRA [16] leverages the difference between two consecutive DGD iterates to realise sublinear convergence rate at (1∕k) for general convex objective functions and linear convergence rate for strongly convex objective functions. The literature [16][17][18] construct only doubly stochastic weight matrices for distributed optimisation over undirected networks. However, in practice, the communications in networks may be directed, which means that the weight matrices cannot be doubly stochastic due to the imbalanced transmission of the information among agents.
The existing methods tackling distributed optimisation problems over directed networks are developed by the methods proposed in [16,19,20] that realise the imbalanced information transmission among agents by adopting average-consensus methods. Nevertheless, these algorithms, limited by diminishing step-sizes, slowly converge to the global optimal solution at a sublinear rate. In order that improve the convergence rate, DEXTRA [21] combines a push-sum technique with EXTRA to attain a linear convergence rate over directed networks when the step-size is chosen from a small interval. An elegant work Push-DIGing [22] extends ADD-OPT [23] to time-varying networks and achieves linear convergence rate by applying the small gain theorem [24]. For eliminating the deficiency of the use of column-stochastic weight matrix, FROST [25] utilises only row-stochastic weights, the implementation of which does not need the agent knowing its out-degree. More recently, Xin et al. [26] provide a novel method, , which simultaneously employs row-and column-stochastic weights to remove limitations caused by the doubly stochastic weights and further simplify the computation and communication process. Then, TV- [27] extends the  to time-varying directed networks. For achieving accelerated linear convergence, the distributed heavy-ball method m [28] adds a momentum term into the  algorithm. Note that the m algorithm is based on a gradient-tracking method first used in [29], which is built on dynamic consensus and renders the agents to approach gradients of the global objective function step-wisely. Furthermore, m algorithm leverages the distributed heavy-ball method to effectively accelerate the convergence. Qu et al. [30] propose a distributed Nesterov method over fixed undirected networks.
Motivated by the insightful work [27], we aim to incorporate both the distributed heavy-ball method [28] and the distributed Nesterov method [30,31] into [27] to technically achieve double acceleration. Moreover, the researchers in [27] do not provide an explicit bound on the constant step-size limited by the analysis technique while the explicit bounds on the step-size have been analysed by many works [3,23,28,[32][33][34], even by the timevarying work Push-DIGing [22]. Therefore, this paper aims at closing this gap. The main contributions of this paper are summarised as follows: i. We incorporate both the distributed heavy-ball method [28] and the distributed Nesterov method [31] into the proposed algorithm. Both the distributed heavy-ball method and distributed Nesterov method are testified to be able to accelerate the convergence. Thus, the proposed algorithm realises double acceleration in comparison with some existing algorithms [4, 12-16, 19-23, 25-27, 29]. ii. In contrast with the time-varying algorithms Push-DIGing [22] and TV- [27], the proposed algorithm employs uncoordinated step-sizes, which allows each agent in the network to locally decide its own constant steps-size and thus increases the autonomy and flexibility of the agents. Based on the gradient-tracking method, the proposed algorithm is proved to converge linearly to the global optimal solution over time-varying directed networks. Furthermore, the explicit upper bounds on the largest step-size and the largest momentum parameter are derived, which are significant to practical applications. iii. The proposed algorithm simultaneously utilises row-and column-stochastic weight matrices, which eliminates the requirement of using doubly stochastic weight matrices existing in [10,11,16]. Furthermore, compared with recently insightful works FROST [25] and Push-DIGing [22], although we simultaneously utilise row-and columnstochastic weight matrices, the proposed algorithm can avoid the process of eigenvector estimation at each iteration, which reduces the computational complexity and communication burden. iv. By applying the proposed double accelerated distributed optimisation algorithm into fixed directed networks, we prove that the fixed directed networks [35] can be considered as a special case of time-varying directed networks. In the case of fixed directed networks, the proposed algorithm can still accelerate the linear convergence in [28] owing to the distributed Nesterov method, which can be observed in Section 4.3.
The remainder of this paper is organised in the following. Section 2 presents some preliminaries and algorithm development. Section 3 provides rigorous convergence analysis. Some real-world data sets are utilised in Section 4 to help conduct experiments. We draw a conclusion and state our future work in Section 5. Some crucial proofs are placed in Appendix to ensure the coherence of the whole paper.

Notations
All vectors in this paper are recognised as column vectors if no otherwise specified. The n × n identity matrix are denoted by I n and so forth. We denote by 1 m the m-dimensional vector with all-one elements. We use x T and A T to respectively denote the transpose of vector x and the transpose of matrix A, and A i j denotes the (i, j )-th element of matrix A. For an arbitrary vector x, we denote its i-th element by [x] i and denote the largest element of vector x by [x] max , while diag{x} represents a diagonal matrix with all the element of x laying on its main diagonal. For two matrices, X and Y with same dimensions; X ≤ Y means that each element in Y − X is nonnegative; X ⊗ Y represents their Kronecker product; the spectral radius of matrix X is denoted by (X ). The notation ‖ ⋅ ‖ denotes the Euclidean norm for vectors and the spectral norm for matrices. For two arbitrary vectors, x, y ∈ ℝ mn , we introduce two weighted inner products, ⟨x, y⟩ r ≜ x T ((diag{ r }) ⊗ I n )y and ⟨x, y⟩ c ≜ x T ((diag{ c }) −1 ⊗ I n )y, which leads to two

Problem reformulation
Assume that a system of m agents communicate over a timevarying directed networks,  k = (,  k ), where  = {1, … , m} is the set of agents and k is the iteration number.  k is the set of directed edges at time k, such that if (i, j ) ∈  k , agent i can transmit messages to agent j , that is, i → j , at iteration k. The common target of agents is to resolve a consensus problem as follows: where x = [(x 1 ) T , … , (x m ) T ] T and each local objective function, f i : ℝ n → ℝ, is only known by agent i. Let x * = 1 m ⊗x * denote the global optimal solution to problem (2). Some necessary assumptions and lemmas in this paper are presented in the following.

2.3
The heavy-ball method The centralised gradient descent method proposed in [36] is wherex k ∈ ℝ n and̃serves as a positive constant step-size.
This method achieves the best convergence rate at (( , where  is the condition number of global objective function f . Clearly, gradient descent is quite slow when  is large, that is, the objective function is ill-conditioned [28]. To address this issue, Polyak in [37] proposes a centralised heavy-ball method as follows:x wherẽ(x k −x k−1 ) is referred as a momentum term with the positive momentum parameter̃, which is utilised to accelerate the convergence process. [31] achieves an accelerated rate of with specific choices of̃and̃.

The Nesterov method
Based on (5), Nesterov in [38] proposes an efficient centralised Nesterov method to accelerate the convergence of (5) through an added momentum term, the details of which are as follows: where y k ∈ ℝ n is a variable and is corrected in the next update. Refer to [31], this method removes the gaps between the lower oracle complexity bounds of the function class. Furthermore, the Nesterov method corrects the gradient in every iteration and makes the update of the gradient more flexible, in some way, which can accelerate the convergence.

The TV- algorithm
We now briefly introduce the existing method to solve problem (1). Fakhteh et al. [27] provide a novel method and the distributed form of which can be concluded as follows: where A i j k and B i j k are row-and column-stochastic matrices, respectively, and is the constant step-size. At each iteration, agent i ∈  maintains two variables: x i k , y i k ∈ ℝ n , where x i k is the local estimate of the optimal solution and y i k is an estimate of the global gradient. We can clearly see that the update (8a) is basically gradient descent with the direction controlled by y i k instead of the local gradient ∇ f i (x i k ). Evidently, the update (8b) tracks the average of local gradients, , referring to [26][27][28][29][30][31]39]. When dealing with the distributed optimisation problem over time-varying directed networks, TV- algorithm converges linearly. However, an explicit range of its step-size fails to be derived and the accelerated techniques are not considered.

Distributed algorithm development
Motivated by the TV- algorithm [27], we provide a novel algorithm that incorporates distributed heavy-ball method and distributed Nesterov method into the iteration. At iteration k, each agent i ∈  controls three variables: . The distributed form of the proposed algorithm is given as follows: where the weights A i j k and B i j k satisfy Assumption 3. Both the uncoordinated step-size, i , and the momentum parameter, i , are locally decided by agent i. As is shown in the proposed algorithm, we incorporate the heavy-ball method [28] into (9a), which can be nearly seen as a centralised gradient descent. In (9b), we bring in another type of acceleration that is the distributed Nesterov method [31]. By using the outcome of update (9b), (9c) is performed as the gradient-tracking update.
We now write the matrix-vector form of (9a)-(9c). The local variables x i k , s i k , y i k are collected by x k , s k , y k ∈ ℝ mn , respectively, and where is the vector of local step-sizes and is the vector of momentum parameters. Then, initialised with arbitrary x 0 , s 0 and y 0 = ∇F (s 0 ), the matrix-vector form of (9) is briefly summarised in the following.

CONVERGENCE ANALYSIS
In what follows, we start with a state transformation: Thus, algorithm (10) can be equivalently rewritten as follows: One can testify that {R k } is a sequence of row-stochastic matrices with the absolute probability sequence, {v k }. The definition of absolute probability sequence is presented in the next.

Definition 2 ([41]
). An ergodic sequence of row-stochastic matrices, where U (c,n) = ∏ c i=n  i is the backward product of { k } and d n s is a constant independent on i.
We next analyse the convergence of algorithm (11). Our analysis depends on the quantities described as follows: (i)x w k = ( T k ⊗ I n )x k , which is the average of x i k weighted by the absolute probability sequence { k } of A k ; (ii) The networked weighted consensus error is set as: The optimal gap corresponding to the weighted estimation can be built as: , which can be regarded as state difference in consecutive update time; (v) An error term associated with gradient estimation is repre- Leveraging the above quantities, we define the vector, t k = [‖x w k ‖, ‖r k ‖, ‖d k ‖, ‖ĝ w k ‖] T , to prove t k → 0 when k → ∞. Clearly, we obtain that x k converges to x * if t k → 0. For the following analysis, we aim at establishing a system inequality as follows: wherē= [ ] max ,̄= [ ] max and the coefficients of the linear system are the elements of M̄,̄. Clearly, if (M̄,̄) < 1, then x k converges to x * at least at the rate of ( (M̄,̄) k ). Note that the constantC = max{C  ,C  } guarantees concurrent multi-step contractions forx w k andĝ w k , whereC  andC  refer to Lemma 2 and Corollary 2, respectively. In the following, we present some necessary lemmas.

Auxiliary relations
where { s } is the unique absolute probability sequence for {D s } and is uniformly bounded away from zero, that is, there exists ∈ (0, 1) such that [ s ] i ≥ and ∀s ≥ 0. The convergence rate is indeed geometric, that is, ∀t ≥ s ≥ 0: where the constants K > 0 and q ∈ (0, 1) rely on w 1 and m in Assumption 3.
The next corollary derives the absolute probability sequence for the sequence {A k } in terms of { k } in Lemma 1.

Corollary 1.
Suppose that Assumptions 2 and 3 hold. Then, the sequence { k } is an absolute probability sequence for the matrix sequence {A k }, where T k = T k , k = sC and for some s ≥ 0, T k is given by Proof. From Lemma 1, we know that { k } is an absolute probability sequence and the products □ We next constructC  -step contraction for { k }, which is fundamental for the convergence analysis of the proposed algorithm.

Lemma 2 ([22]). Suppose that Assumptions 2 and 3 hold. Recall
where Q  = 2m The next corollary constructs the multi-step contraction for the sequence

Corollary 2 ([27]). Suppose that Assumptions 2 and 3 hold. Recall
where Q  = 2m m mC +1 andC  ≥ C is an integer. Then, for ∀k ≥C  − 1 and any vector b ∈ ℝ mn , if a =  k,C  b, we have The following lemma is not uncommon in the convex optimisation theory [26,28], which shows a fixed ratio of the contraction to optimal solution.

Lemma 3 ([28]).
Letf be L f -smooth and -strongly convex. For anỹ x ∈ ℝ n and 0 < < To proceed, we start with seeking an upper bound on ‖y k ‖. Lemma 4. Suppose that Assumption 1 holds. For ∀k ≥ 0, we have an inequality holds as follows: Proof. See Appendix A. □ Lemma 5. Suppose that Assumptions 1-3 hold. For ∀k ≥ 0, we have an inequality holds as follows: Proof. See Appendix B. □ Lemma 6. Suppose that Assumptions 1-3 hold. For ∀k ≥ 0, we have an inequality holds as follows: Proof. See Appendix C. □ Lemma 7. Suppose that Assumptions 1-3 hold. For ∀k ≥ 0, we have an inequality holds as follows: Proof. See Appendix E. □ Lemma 9 ([42]). For a nonnegative matrix, X ∈ ℝ n×n , and an arbitrary positive vector, x ∈ ℝ n . We have (X ) < w when Xx < wx with w > 0.
We next define some key matrices used in Theorem 1 as follows: where the definitions of parameters i and  i are placed in Appendix F for clarity of exposition.

Lemma 10 ([42]
). Suppose that there are two nonnegative matrices with same dimensions, T andT . If any element ofT is no less than the counterpart element in T such that T i j ≤T i j , then we have (T ) ≤ (T ).
Therefore, we redefine aforementioned matrices as follows: where the parameterŝi and i are defined in Appendix F. Furthermore, for simplifying the process of calculation, we define some parameters used in Theorem 1 as follows: ( 1 l 1 + 1 l 2 + 2 l 4 ) . Then for ∀k ≥C − 1, the following inequality holds

Remark 2.
It is worth mentioning that the theoretical bounds on the largest step-size and the largest momentum parameter derived in Theorem 1 are conservative. In fact, one can manually choose more appropriate stepsizes and momentum parameters in practice. We also find that the bounds on the largest step-size and the largest momentum parameter contain some global parameters, such as  and  . Therefore, we use the notion of sufficiently small but positive step-sizes and momentum parameters to theoretically guarantee all the communication and computation process conducted in a fully local way, which is not uncommon in the recent literature [23,28]. Besides, we emphasise that the theoretical analysis on the acceleration developed by the distributed heavy-ball method and distributed Nesterov method remains an open problem. This paper demonstrates the acceleration through a series of simulations.
In following sections, we apply the proposed algorithm into fixed directed networks, which can be considered as a special case of the time-varying directed networks. Specifically, in fixed directed networks, we assume that A k = A, B k = B,  k = ,  k = , k = 0, 1, …. The next assumption serves as a special case of Assumption 2.

Assumption 4.
The directed network, , is fixed strongly connected.

is the left eigenvector of a row-stochastic matrix, A, and B ∈ ℝ m is the right eigenvector of a column-stochastic matrix, B. The next lemma is built on the above assumptions.
Lemma 11 ([32]). Suppose that the communication network, , is fixed strongly connected. Considering the augmented weight matrices,  and , there exists vector norms, ‖ ⋅ ‖ r and ‖ ⋅ ‖ c , such that Before presenting Theorem 2, we first define an important matrix as follows: where  ,  are defined in Lemma 11 and the definitions of parameters  i , i = 8, 9, 10, and i , i = 17, … , 29 are placed in Appendix F. Due to the introduction of the vector norms, ‖ ⋅ ‖ r and ‖ ⋅ ‖ c , we need to define a new vector t k = [‖x w k ‖ r , ‖r k ‖, ‖d k ‖, ‖ĝ w k ‖ c ] T . Moreover, we define some parameters used in Theorem 2 as follows: Theorem 2. Suppose that Assumptions 1, 4 hold and let 0 <̄< 1 mL f . Then for ∀k ≥ 0, the following inequality holds: When the largest step-sizes,̄, satisfies , and the largest momentum parameter,̄, satisfies then (W ) < 1 and thus ‖x k − x * ‖ converges linearly to zero at least at the rate of ( (W ) k ).
Proof. The proofs are in line with that of in Theorem 1 and thus the detailed proofs are omitted here for length considerations. □

SIMULATION RESULTS
In order to illustrate the performance and applications of the proposed algorithm, some numerical simulations are conducted in this section. We utilise a simple uniform weighting tactics to construct row-and column-stochastic weights A i j k ,B i j k : where d i k,in is the in-degree of agent i at iteration k and d j k,out is the out-degree of agent j at iteration k. The performance of all the tested algorithms are described by plotting the residual, In the following simulations, we assume that the samples are distributed equally over the agents, that is, q i = N ∕m, i ∈ , where N is the total number of samples and m represents the number of agents in the network. Recall that̄and̄denote the largest constant step-size, i , and the largest momentum parameter, i , i = 1, … , m, respectively. We emphasise that in the following simulations, EXTRA works in a fixed undirected network which is generated by the Laplacian method [28].

Distributed least-squares
This simulation considers a least-squares problem provided in [3] that solves for an unknown signalx ∈ ℝ n , where a network of m = 5 agents cooperatively resolve the following optimisation problem where C i ∈ ℝ q i ×n is the sensing matrix; d i = C ix + e i is the measurement given by agent i = 1, … , 5 and e i ∈ ℝ q i is the independent and identically distributed noise. We set q i = 1000 and the dimensions of all signals are assumed to be n = 10, while each dimension represents one feature of the signals. As is shown in Figure 1, m = 5 agents communicate with each other over a time-varying network with C = 4. Figure 2 demonstrates that the proposed algorithm shows better performance than the other tested algorithms and the convergence rates are linear.

Distributed quadratic programming
In this case study, we aim at exploring the impact on the acceleration brought by the changes of condition number  = L f ∕ . Assume that a network of m = 50 agents cooperatively resolve where matrices G i ∈ ℝ n×n are diagonal and positive-definite; vectors c i ∈ ℝ n are randomly generated; the dimension is set as n = 100. Figure 3 depicts a time-varying directed network at a particular moment. The condition number, , off is calculated by the ratio of the largest to the smallest eigenvalue of G = ∑ m i=1 G i . It is worth mentioning that the proposed algorithm reduces to TV- with uncoordinated stepsizes when the momentum parameter = 0. From Figures 4-6, one can clearly see that the proposed algorithm with acceler-

Distributed logistic regression
In this experiment, large-scale binary classification problems based on real-world data set are solved by the proposed algorithm and other tested algorithms over a fixed directed network. The global objective function can be formulated asf (x) = ∑ m i=1 f i (x) and the local loss objective function, f i , can be formulated as: where c i j ∈ ℝ n is the j -th training sample and b i j ∈ {+1, −1} is the corresponding label, both of which are only accessed by agent i; 2m ‖x‖ 2 is a regularisation term used to avoid overfitting of the data. We set m = 100 and = 0.1 in this simulation and utilise the data set from MNIST handwritten digit database [43]. A subset of 12,000 samples, 8 and 9 labelled as −1 and +1, is chosen from MNIST, from which N = 8000 random samples are chosen as training set and the rest of the samples are used for testing. Figure 7 shows a part of randomly selected training samples from 12,000 samples, where each image is featured as a 784-dimensional vector. Figure 8 depicts a fixed strongly connected directed network with m = 100 agents. The performance of the tested algorithms shown in Figure 9 demonstrates that the proposed algorithm shows accelerated linear convergence than existing well-known algorithms over a fixed directed network. Figure 10 depicts the testing error rate of the tested algorithms.

CONCLUSIONS AND FUTURE WORK
In this paper, we concentrated on studying distributed optimisation over time-varying directed networks. After reviewing the distributed heavy-ball method, the distributed Nesterov method, and TV- algorithm, a double accelerated distributed optimisation algorithm was proposed. Then, based on the analysis of matrix inequalities, we have shown that the proposed algorithm converges linearly to the global optimal solution. Moreover, the proposed algorithm was applied into fixed directed networks and also achieves a linear convergence rate. We emphasise that the acceleration was not proved analytically and was observed via a series of simulations, which is Then, due to the fact that (1 T m ⊗ I n )∇F (x * ) = 0, it holds where we apply Lipschitz continuity of F in the first inequality and the second inequality uses the Jensen's inequality. In next step, we continue to bound ‖s k − x * ‖. From (11c), we have Combining (A2) with (A3) yields Noting that ‖y k ‖ ≤ ‖V k ⊗ I n ‖‖g k ‖ and ‖V k ⊗

APPENDIX B: PROOF OF LEMMA 5
Proof. According to (11b), it holds where  and Q  are defined in Lemma 2, and the second inequality applies Lemma 2. With the help of Lemma 4, one can directly obtain where the first inequality leverages (11b) and (11c). Then, from [11,Lemma 4], we know that T k+1 v k ≥ 1∕n nC and T k+1 v k ≤ 1, for ∀k ≥ 0. Thus, it holds Applying Lemma 3, for 0 <̄< Combining (C1), (C2) with (C3), the upper bound on ‖r k+1 ‖ is obtained as follows:

APPENDIX D: PROOF OF LEMMA 7
Proof. From (11b), one can verify □

APPENDIX F: DEFINITION OF PARAMETERS
For the clarity of exposition, we place the definitions of parameters  i , i , i , and̂i in this section.