A conditional gradient algorithm for distributed online optimization in networks

This paper addresses a network of computing nodes aiming to solve an online convex optimisation problem in a distributed manner, that is, by means of the local estimation and communication, without any central coordinator. An online distributed conditional gradient algorithm based on the conditional gradient is developed, which can effectively tackle the problem of high time complexity of the distributed online optimisation. The proposed algorithm allows the global objective function to be decomposed into the sum of the local objective functions, and nodes collectively minimise the sum of local time-varying objec-tive functions while the communication pattern among nodes is captured as a connected undirected graph. By adding a regularisation term to the local objective function of each node, the proposed algorithm constructs a new time-varying objective function. The proposed algorithm also utilises the local linear optimisation oracle to replace the projection operation such that the regret bound of the algorithm can be effectively improved. By introducing the nominal regret and the global regret, the convergence properties of the proposed algorithm are also theoretically analysed. It is shown that, if the objective function of each agent is strongly convex and smooth, these two types of regrets grow sublin-early with the order of O (log T ), where T is the time horizon. Numerical experiments also demonstrate the advantages of the proposed algorithm over existing distributed optimisation algorithms.


INTRODUCTION
Distributed convex optimisation has been greatly concerned by researchers in many fields [1][2][3][4][5][6][7]. Classical problems such as distributed tracking, estimation and detection are also essentially optimisation problems [8]. The distributed optimisation problem is mainly to perform global optimisation tasks assigned to each node in the network. As each node has limited resources or partial information about the task, the nodes, therefore, collaborate to perform data collection and update the local estimation via sharing the collected information. Distributed optimisation imposes a lower computational burden on nodes and the networked system remains robust even if a node encounters a local failure, thus it can effectively overcome the defect of relying on a single information processing unit in a centralised scenario. Distributed optimisation has been widely applied to the case of time-invariant cost functions [9,10] applications, distributed network systems are often in a dynamic and uncertain environment. For example, consider the issue of tracking moving objects, the purpose of which is to track the position, velocity and acceleration of the objects. Such issues have always been the main focus of online learning in the field of machine learning. Therefore, integrating online optimisation with distributed optimisation and employing any variable cost function to represent the uncertainty of a multi-agent network system can be effective for real-time processing of dynamic data streams of network nodes.
With the rapid development of the distributed online optimisation technology, many traditional optimisation algorithms have been extended to distributed online cases. Some traditional optimisation algorithms such as gradient descent method and dual averaging method have been widely applied to the distributed online optimisation in recent years. For instance, [1,[13][14][15] introduce four classical methods: distributed online gradient descent (DOGD), distributed online alternative direction multiplier method, distributed online dual average (DODA) and distributed online conditional gradient (DOCG).
The conditional gradient method (also known as Frank-Wolfe, FW) proposed in [16] is essentially a first-order optimisation method, which can theoretically achieve a lower convergence rate than other effective optimisation algorithms (such as the Interior-point polynomial algorithm [19]). Compared with the classic projection gradient algorithms [17,18], the FW algorithm is more attractive due to its projection-free property and ability to handle structural constraints. To be specific, the projection gradient algorithm can make the current iteration point far away from the feasible set of the optimisation problem when performing an optimisation operation and thus obtain an infeasible point. For such a case, it is necessary to regain the feasibility of the gradient algorithm via projection operation. However, the projection means that a point must be located in the feasible set, and the feasible point is usually required to be the shortest distance from the current infeasible point, which is essentially equivalent to solving a convex quadratic programming problem. However, solving such a convex quadratic programming is a hard nut to crack, especially for the high-dimensional constrained optimisation problem, which makes it necessary to carry out effective linear optimisation. Moreover, the FW method has been proved to be a powerful tool for solving large-scale optimisation problems since it can effectively avoid critical issues such as the difficulty of calculating orthogonal projections in the first-order optimisation method.
Therefore, by introducing local linear optimisation oracle into the original conditional gradient algorithm, we present an online distributed conditional gradient algorithm (for short, ODCG), which also extends the conditional gradient online optimisation algorithm to the distributed setting. The regret bound seminally proposed in [20] is now proven to be an effective tool for characterising the performance of an online learning algorithm, but it is noted that the pros and cons of online algorithms mainly depend on the time complexity of regret bound.
The main contributions of this paper are summarised briefly in the following:(1) Firstly, in comparison with the algorithms in [1] and [14], our proposed algorithm constructs a new timevarying objective function via adding a regularisation term to the local cost function of each node; (2) secondly, different from the algorithm proposed in [15], our proposed algorithm considers to find the next iteration direction by using local linear optimisation oracle to improve the regret of the algorithm; (3) thirdly, we also establish convergence properties of the proposed algorithm via introducing the nominal regret and the global regret, which can not only effectively reveal all the local regrets of the entire network but also characterise the quality of local estimations from the perspective of the entire network.
The remainder of this paper is organised as follows. In Section 2, we describe the problem framework and propose an online distributed conditional gradient algorithm. In Section 3, we provide a theoretical analysis of the regret bound of the proposed algorithm in the distributed online convex programming. Empirically, our algorithm outperforms other existing algorithms on the L2 regularised linear support vector machine problem, as shown in Section 4. Finally, Section 5 concludes the paper.
Notations and Terminologies: Throughout the whole paper, we use ℝ d to denote the d -dimensional real space, and ℝ + to represent positive real number, and ∇ to denote gradient operator. We also employ [n] to denote the set {1, 2, … , n} for any integer n and ‖ ⋅ ‖ to represent the Euclidean norm. Given a pair of vectors x, y ∈ ℝ d , ⟨x, y⟩ denotes the standard Euclidean inner product. In addition,  ⊆ ℝ d denotes a closed convex subset of ℝ d , and all the vectors are in column format.

Distributed online convex optimisation
We begin by stating the problem of online convex optimisation for one learner and then display the networked version, which is the focus of the paper. Suppose we have a sequence of , … , T }, a decision maker chooses an action x t ∈  and after committing to this decision, a convex objective function f t :  → ℝ is revealed. Consequently, the decision maker is faced with a loss of f t (x t ). In this scenario, we consider the centralised online optimisation problem as follows where f t :  → ℝ is a time-varying global objective function.
We now review the setup for a distributed version of the online optimisation problem [11,12,14]. Consider a network of n agents communicating with each other over a connected undirected graph denoted by  = (V, E ), with vertex set V = [n] and edge set E ⊂ V × V . At each round t = {1, … , T }, each agent i ∈ V chooses its state x i,t ∈ . After this, a locally convex objective function f i,t :  → ℝ is revealed, and the agent suffers the loss f i,t (x i,t ). In this scenario, the whole network at time t aims to minimise the objective function which is distributed among agents and is revealed when agents have chosen their states. It is worth noting that each agent i only knows its own objective function f i,t (x) and cannot have access to the function f t (x). Therefore, across the time period from t = 1 to T , the goal of the whole network is to minimise the objective function In this scenario, we consider solving the following distributed online optimisation where f i,t :  → ℝ is a time-varying and continuously differentiable convex objective function.

Regret
An online algorithm to optimise (1) should mimic the performance of its offline counterpart, and the gap between them is called the regret. If the offline problem is to minimise , then the regret is called a static regret [11]. The following two types of static regret bound [13] are adopted in this paper.
The first category is the nominal regret bound, that is, The second type is the global regret bound, that is, where Remark 1. Note that R(T ) corresponds to all local regrets of the entire network. However, it may not effectively reflect the similarity among all local estimations, which implies that the difference among local estimations may be very large even if R(T ) is small. Hence, it is necessary to introduce global regret, that is, R j (T ), which characterises the quality of local estimations from the perspective of the entire network. Moreover, in the case of the centralised online learning (n = 1), the global regret and the nominal regret are equivalent, but they are utterly different for the case of the distributed online learning. Therefore, it is more significant to analyse the global regret than the nominal regret in the case of the distributed online learning.

An online distributed conditional gradient algorithm
We are now in a position to state the proposed ODCG in Algorithm 1 to solve the optimisation problem (1).
For the sake of necessity, we introduce the following definition, which is a primary ingredient of our work. Definition 1. (Local Linear Optimisation Oracle) [22]. We formally state a procedure LLOO(x, r, g), where x ∈ , r ∈ ALGORITHM 1 Online Distributed Conditional Gradient Algorithm (ODCG)

1:
Input: convex set  and its diameter D, maximum round number T , upper bound of the gradient G , parameter , step-sizes { t } T t =1 and the weighted adjacency matrix W .

5:
Compute subgradients: 10: where (x, r ) denotes the Euclidean ball of radius r centred at x.
The specific application of the local linear optimisation oracle to be extended to the distributed case is shown in detail in Algorithm 1.
The implementation of the ODCG algorithm is simple and as follows. Each node i firstly observes its loss function f i,t and suffers its loss f i,t (x i,t ) at time t . Subsequently, each node i computes its subgradient f i,t (x i,t ) at time t (cf.
Step 5 in Algorithm 1). To ensure full utilisation of the important information of nodes, the information interaction among nodes is carried out by weighted average. Here, we design the update expression of the original gradient to obtain the estimation of the gradient (ĝ i,t ) instead of directly performing the weighted average of the original gradient information (g i,t ) (cf. Steps 7 and 8 in Algorithm 1), which can ensure the security of the original information. Moreover, compared with the original standard online conditional gradient algorithm [21] and the algorithms in [1,14], we define a new time-varying objective function F i,t (x) via adding a new time-varying regularisation term to regularise the timevarying cost function (cf. Step 9 in Algorithm 1), and find next iteration direction by local linear optimisation oracle step given by Step 10 in Algorithm 1. Finally, each node computes its new update after interacting with its neighbouring nodes (cf. Steps 11 and 12 in Algorithm 1).
Remark 2. In Algorithm 1, the learning rate t is positive and non-increasing. For a centralised case, a local linear optimisation oracle is considered in [22] to find a minimiser of a linear problem over feasible set . Similarly, in the distributed case, our algorithm also introduces a local linear optimisation oracle (cf. line 10 in Algorithm 1), the purpose of which is to achieve a better regret bound. The efficient linear optimisation oracle is exploited to find a minimiser of a linear problem as in [22] over  such that our proposed algorithm can thus effectively improve the regret bound.

REGRET ANALYSIS
In this section, we aim to establish convergence properties of the proposed ODCG algorithm under the following necessary assumptions.

Assumption 1. (a)
The function f (x) :  → ℝ is -smooth and -strongly convex for the convex set . If ∀x, y ∈ , we have Then, for ∀x, Assumptions 1 and 2 are common assumptions in distributed optimisation (e.g. [1,15,24]). In addition, many objective functions f i satisfy this type of Lipschitz condition. For instance, Assumption 1(b) holds for any convex function on a compact domain, or for a polyhedral function on an arbitrary domain [27]. (b) The weighted adjacency matrix W of graph  is symmetric and doubly stochastic, which satisfies Remark 3. Assumption 3 about the weighted adjacency matrix W is a common assumption for the distributed setting [1,14,15]. It can be seen that the aforementioned weight matrix W has a simple eigenvalue one and all the other eigenvalues less than 1. Moreover, W should be designed properly such that it is symmetric and satisfies W i j = W ji > 0 if (i, j ) ∈ E, otherwise, W i j = W ji = 0.

Auxiliary lemmas
In this part, we will give some lemmas that are useful for the regret analysis in the sequel. Line 9 of Algorithm 1 defines a new regularisation function F i,t (x), assuming that the optimal solution of F i,t (x) is x * i,t = arg min x∈ F i,t (x). Under Assumption 1 and the fact that The following Lemma 1 will illustrate that the function F i,t (x) is Lipschitz continuous on the convex set . Proof. For ∀y, z ∈ , we have where the first inequality follows from the convexity of F i,t (x), the second inequality holds due to the property of the norm, and the last inequality uses Assumption 2. The result in Lemma 1 turns out that F i,t (x) is Lipschitz continuous with constant of G + 2 D, which provides a mathematical condition for the analysis of the regret.
The above inequality shows that the functionf i,t (x) is Lipschitz continuous, which will play an important role in the subsequent proof of Theorem 1.
Under the result of Lemma 1, we will provide the upper bound of ‖x i,t − x * i,t ‖ in Lemma 2.

Lemma 2. Suppose Assumptions 1 and 2 hold. Let the local estimations
{x 1,t , x 2,t , … , x n,t } T t =1 be the sequence generated by Algorithm 1, then . Proof. Due to the fact that F i,t (x) is Lipschitz continuous in Lemma 1, we obtain Hence, under Lemma 1 and (4), it yields By the fact that ‖x i,t − x * i,t ‖ ≥ 0, then we have Lemma 2 shows the upper bound of ‖x i,t − x * i,t ‖, which provides a supporting condition for the derivation of the network error in the sequel.
Lemma 3 provides an upper bound on the deviation of the local estimation of each node at each iteration from their average value and thus reveals the network error.

Lemma 3. (Network Error). Suppose Assumption 3 holds. Let the local estimations {x
be the sequence generated by Algorithm 1 with the learning rate t = √ t , > 0. Then, we have , ∀i ∈ V , 2 (W ) denotes the second largest eigenvalue of matrix W and 1 − 2 (W ) denotes the corresponding spectral gap value.
Therefore, the upper bound of ‖ i,t +1 ‖ can be given as follows where the first inequality holds due to the fact that p i,t is the vector with the smallest value on the feasible set and the second inequality follows from Lemma 2.
Using W to denote the th power of W , and W i j to denote the ith row and j th column of the matrix W , we can acquire the following recursive relation by simply algebraic operations, that is, Hence, the following bound of network error can be easily derived by (7) where the first inequality holds due to the property of the norm, the second inequality follows from the following property of mixing matrix [25], and the last inequality holds due to . So far, we complete the proof of the network error. Lemma 3 turns out that the error bound is related to the network parameter 2 (W ). According to the proof of Lemma 3, we can clearly recognise the network error of the proposed algorithm and further characterise the robustness of the proposed algorithm.

Regret bound
In this part, we will analyse the nominal regret and global regret of the ODCG algorithm under Lemmas 1-3, which are shown in Theorems 1 and 2.
Proof. Noting that F 0 = T 0 2 ‖x − x 1,1 ‖ 2 and by the definition off i,t (x), we obtain Now, by the definition of F i,t (x) andf i,t (x), together with (14), then we can get Next, the upper bound of Now, summing (16) Next, we turn our attention to the bound of is Lipschitz continuous and under Lemma 2, then we obtain Now, summing (18) Plugging (17) and (19) into (13) gives Remark 4. Theorem 1 implies that, under the condition of strongly convex, the nominal regret for Algorithm 1 has a sublinear growth of order O(log T ). The result in Theorem 1 also represents the sum of all local regrets of the entire network. However, it does not effectively reflect the similarity among all local parameter estimations, because there may exist a case that R(T ) is small but the difference among local parameter estimations may be very large. On the other hand, the nominal regret fails to incentivise the collaboration among agents since agents are effectively independent and only need to learn strategies related to their local costs. Hence, it is necessary to introduce the global regret, that is, R j (T ), which considers games where agents have an incentive to collaborate and also characterises the quality of local estimations from the perspective of the entire network. Theorem 2 will reveal the global regret bound. .
Proof. The difference between two types of regret defined in (2) and (3) is as follows Using Lemma 3, the last inequality in (21) can be further derived as follows Now, substituting (20) and (22) into (21), we have So far, we complete the proof of Theorem 2.
Remark 5. Theorem 2 indicates that the global regret for strongly convex functions also has a sublinear growth with order of O(log T ), which measures the quality of the entire network. It can also be seen that, under the case of the centralised online learning, the global regret and the nominal regret are equivalent, but they are absolutely different for the case of the distributed online learning. In addition, 1 − 2 (W ) in Theorem 2 represents the spectral gap, which is quite natural and, for many families of undirected graph, measures how fast the informa-tion can propagate within the network (the larger the spectral gap, the faster the convergence is). Our algorithm also extends the delicate local linear optimisation oracle in the centralised FW algorithm [22] to the distributed setting.

NUMERICAL EXPERIMENTS
In this section, we numerically illustrate the performance of the ODCG algorithm on L2 regularised linear support vector machine problem (linear-SVM) using the synthetic data set and Mushroom data set, which are available from the SGDlibrary [26].
The objective is to consider solving the following online optimisation loss function where n is the number of nodes, y i,t represents the label of the ith node at time t , x i,t ∈ ℝ d denotes a data example of the node i at time t , w = [w T 1 ; w T 2 ; ⋯ ; w T l ] ∈ ℝ l ×d (l is the number of classes) is the weighting matrix to be optimised and is a regularisation parameter ( is set as 0.1).
In our experiments, we consider the 1 ball  1 = {x ∈ ℝ d : ‖x‖ 1 ≤ r}, where ‖x‖ 1 is the 1 norm of x, and use the Metropolis constant edge weight matrix W [28]. For other parameters, setting as follows: the diminishing step-sizes t = 1 √ t and the parameter = 0.1.

Comparison with classical algorithms
To evaluate the performance of ODCG over its competing algorithms, we compare the ODCG with two classical algorithms over 10 nodes, that is, DOGD [1] and DODA [14]. The parameters we set for both DOGD and DODA are the same. The communication graph is set as described in Assumption 3.
To test the performance of the proposed algorithm in different data sets over its competing algorithms, we conduct our experiments on the synthetic data set and the Mushroom data set, respectively. The results are shown in Figures 1 and 2. It can be clearly observed from Figures 1 and 2 that, no matter in the synthetic data set or real data set, the algorithm we proposed performs better than the other two algorithms, which illustrates the necessity and usefulness of the proposed algorithm in distributed online learning frameworks.

Comparison of different nodes
Besides, we compare the performance of the ODCG algorithm with different nodes. The ODCG is tested with n = 6, 26, 54. Each node computes its new update by running lines 7-12 in Algorithm 1.  Figure 3 depicts the variation of the costs or losses f t (w) of the proposed algorithm running on different nodes. In Figure 3, we can clearly observe that the costs decrease more slowly on larger graphs than on smaller graphs, which nicely confirms our theoretical results.

Comparison of different network topologies
Finally, we compare the performance of the ODCG under different choices of network topologies. Graphs are set as follows, which represent different levels of connectivity. • Complete graph [15]. This graph shows the highest level of connectivity: all nodes are connected to each other. • Cycle graph [15]. This graph represents the lowest level of connectivity: each node has only two close neighbours. • Watts-Strogatz (WS). The generation of this graph depends on two tunable parameters: the average degree of the network K and the rewiring probability p [29]. Usually, the higher the rewiring probability, the better the connectivity of the graph [30]. We tune the parameters K = 3 and p = 0.4 to achieve an intermediate level of connectivity in our simulation.
We run experiments on the aforementioned three types of graphs with fixed 10 nodes. Figure 4 depicts the performance of the proposed algorithm under different network topologies.
It is interesting to notice that graphs with better connectivity result in faster convergence.

CONCLUSION
This paper investigated the online optimisation problem over decentralised networks and proposed an online distributed conditional gradient algorithm. Then we detailed and analysed the convergence properties of the proposed algorithm and illustrated that both the global regret and the nominal regret have a sublinear growth with order of O(log T ). Finally, the effectiveness of the proposed algorithm was also demonstrated by tackling a L2 regularised linear support vector machine problem, with the advantages over the existing algorithms being confirmed through numerical experiments. It is worth noting that by assuming that the objective function of each agent is strongly convex and smooth (Assumption 1), we conducted the analysis of the regret and established the main results. However, in many practical applications, the objective function may be non-smooth. For the case of non-smooth objective functions, the authors of [31] proposed a distributed subgradient-free stochastic optimisation algorithm over timevarying network and proved that the consensus of estimations and the global minimisation can be achieved with probability one. The authors of [32] presented a distributed quasimonotone subgradient algorithm for non-smooth convex optimisation over directed graphs. For the distributed non-smooth convex optimisation problems in second-order multiagent systems, the authors of [33] recently developed a novel continuoustime distributed proximal-gradient algorithms with derivative feedback. Motivated by these excellent efforts [31][32][33], it would be of interest to focus on investigating distributed online learning algorithms to obtain dynamic regret bounds for non-smooth or non-convex objective functions over time-varying networks in the future work.