On differential privacy for federated learning in wireless systems with multiple base stations
Abstract
In this work, we consider a federated learning model in a wireless system with multiple base stations and inter-cell interference. We apply a differentially private scheme to transmit information from users to their corresponding base station during the learning phase. We show the convergence behavior of the learning process by deriving an upper bound on its optimality gap. Furthermore, we define an optimization problem to reduce this upper bound and the total privacy leakage. To find the locally optimal solutions of this problem, we first propose an algorithm that schedules the resource blocks and users. We then extend this scheme to reduce the total privacy leakage by optimizing the differential privacy artificial noise. We apply the solutions of these two procedures as parameters of a federated learning system where each user is equipped with a classifier and communication cells have mostly fewer resource blocks than numbers of users. The simulation results show that our proposed scheduler improves the average accuracy of the predictions compared with a random scheduler. In particular, the results show an improvement of over 6%. Furthermore, its extended version with noise optimizer significantly reduces the amount of privacy leakage.
1 INTRODUCTION
Machine learning (ML) systems are expected to play an important role in future mobile communication standards [1]. With increasing applications of ML schemes in wireless systems, new technologies are emerging to enhance the performance of such systems. On the other hand the wireless technology itself can also be deployed to enhance the ML procedures [2]. Among possible candidates, federated learning (FL) has been shown to have considerable promise [3-5] and has the potential to benefit from wireless communication.
FL solves several issues of centralized ML systems by distributing the learning task among several edge devices. One advantage of using an FL system, which makes it a good fit in a wireless setting, is that edge devices do not need to transmit their local datasets to the server. This reduces the amount of wireless resources that is required for accomplishing the given ML task. Apart from this, the privacy of each edge device is not completely compromised since the server does not have a direct access to the data [6].
FL schemes operating over wireless networks have been extensively researched in recent years; see, for example, references [7-13]. In reference [8], the authors studied the effects of wireless parameters on the FL process. They derived an upper bound on the optimality gap of the convergence terms and proposed an optimization problem to minimize the upper bound by considering wireless parameters like resource allocation, user scheduling, and packet error rate. Some other works that studied the communication aspects of FL are references [14-17]. The work in reference [17] considers a wireless FL and applies client scheduling to minimize the training latency for a given training loss. Moreover, FL with several layers of aggregation or with hierarchy has been studied in references [18-23].
Although the training data of each device in FL is not transmitted to the server, yet a function of the local model (query) is still sent to the server. It has been shown that this local model might leak some information about the training data [24]. To mitigate this drawback, FL has been extensively studied together with a privacy preserving scheme called differential privacy (DP) [25].
DP-based schemes follow the principle of not being adversely affected much by having one's data used in any analysis [26]. This powerful notion is well established and is applied in the industry. To realize a DP-based FL system, each edge device adds some artificial noise to its transmitting information. This noise provides a certain amount of privacy depending on the noise power and sensitivity of the query function.
DP-based FL and its convergence behaviour have been extensively studied; see, for example, the works in references [27-34]. In this regard, the work in reference [27] addresses the privacy implementation challenges through a combination of zero-concentrated differential privacy, local gradient perturbation, and secure aggregation. Meanwhile, the work in reference [31] considers the resource allocation and user scheduling to minimize the FL training delay under the constraint of performance and DP requirements. In reference [33], decentralized FL algorithms in wireless Internet of Things (IoT) with DP have been studied. Finally, reference [34] presents a closed-form global loss and privacy leakage of a DP-based FL system and then minimizes the loss and privacy leakage.
However, none of these works consider a joint learning and resource allocation scheme for DP-based FL that considers the effects of inter-cell interference. Here, we adopt the framework in reference [8] and consider a wireless FL system in a multiple base stations scenario. Additionally, we consider DP noise added to the average of gradients [27] and combine this approach with resource scheduling.
The goal of the FL system here is to train a global model for a given predictor. We introduce an iterative DP-based FL scheme with two levels of aggregation (Algorithm 1) and then derive an upper bound on the optimality gap of its convergence terms (Theorem 1).
We then propose an optimization problem whose goal is to improve the convergence of the upper bound on the optimality gap of Algorithm 1 and simultaneously reduce the total privacy leakage. In this regard, the optimization problem is with respect to certain variables like user and resource scheduling, uplink transmit powers, and the amount of DP noise that is applied by each user to its transmitting information.
Since the proposed optimization problem is a non-linear multi-variable mixed integer programming, we divide it into two simpler schemes.
First, we present a sub-optimal approach to minimize the objective function only with respect to resource scheduling variables in a sequential manner, that is, cell by cell. This reduces the original problem to a linear integer programming task and substantially simplifies the implementation. We call this scheme also the optimal scheduler (OptSched). Since this approach performs sequentially from one cell to another one, the amounts of optimal transmit power should be adjusted carefully due to the effects of inter-cell interference. To tackle this problem, we introduce a procedure to determine the users' optimal transmit powers by solving a simple optimization problem.
Next, we enhance the OptSched scheme by further minimizing the objective function of the proposed optimization problem with respect to the DP noise. This leads us to a convex optimization problem with respect to the DP noise standard deviations. We call this extended scheme the optimal scheduler with DP optimizer (OptSched+DP).
We present all the numerical optimizations and benchmarking results. In this regard, we apply Python optimization packages like CVXPY, CVXOPT, GLPK, and ECOS [35-39]. The numerical results show that our proposed schemes (OptSched and OptSched+DP) reduce the objective function of the optimization task substantially compared with the case in which we randomly allocate the resources and apply the DP noise.
Next, we apply these (sub-)optimal parameters to our iterative learning scheme (Algorithm 1). In this regard, each user is equipped with a fully connected multi-layer neural network as a classifier. Furthermore, we assume that communication cells have mainly more users than available resource blocks. This is a legitimate assumption due to the bandwidth limitation. We then perform simulations to measure the accuracy, loss, and the amount of the privacy leakage in such a system for the proposed algorithms. To realize the simulations, we apply the TensorFlow, NumPy, and Matplotlib packages [40-42].
The simulations show that the OptSched scheme predominantly improves the classification accuracy by scheduling the users who have larger data chunks and better uplink channels. In our simulation settings, an average accuracy improvement of over is achieved compared with a random scheduler. The OptSched+DP scheme, on the other hand, achieves a significant reduction in privacy leakage of individual users by systematically adjusting the DP noise power and moderately sacrificing accuracy.
Notation.We denote vectors by lowercase bold letters, for example, . Matrices are represented by uppercase bold letters like , or the identity matrix with rows and columns. Sets are denoted by Calligraphic fonts like . Random mechanisms as a special kind of functions are represented by Fraktur fonts, for example, . The transpose of a vector is denoted by . Logarithms are assumed to be to the basis 2. The set of real numbers is represented by . denotes the set .
2 SYSTEM MODEL
We begin this section by reviewing some preliminary notions on DP that are required here. The complete list of definitions can be found in references [25, 26, 43]. Table 1 provides a list of major notation that is used throughout this work.
Notation | Description |
---|---|
Vector of all in the cell | |
Scheduling of the user | |
Communication rate of the user | |
Length (dimension) of models or | |
Distance of the user to its base station | |
Global loss function at round | |
Channel between the user and its station | |
Uplink interference power from the cell at | |
Total number of all samples | |
Total number of scheduled samples | |
Number of data samples (rows) of | |
Local loss function at the user | |
Upper bound on the gradients of local losses | |
Mechanism at the user at round | |
DP noise at the user at round | |
Vector of all in the cell | |
Transmit power at the user | |
Query function at the user at round | |
Number of available resource blocks in each cell | |
Matrix of all in the cell | |
Scheduling of resource block for | |
Number of learning rounds (iterations) | |
Set of users in the cell | |
Global model at round | |
Local model of the user at round | |
Database belonging to the user | |
th data sample (row) of | |
Optimization constant | |
Learning step size | |
Privacy leakage at the user | |
Vector of all in the cell | |
DP noise standard deviation at the user |
2.1 Differential privacy model
Let a data universe and the distribution on it be given. Assume that a database is denoted by a matrix and contains rows of independent and identically distributed (i.i.d.) -dimensional samples (row vectors). Two databases are called adjacent if they differ only in one row.
A query (mechanism) is a function which takes a database as input and gives a -dimensional output. If the output of the query contains randomness then it is called a randomized mechanism.
In the following, we introduce the notion of privacy for randomized mechanisms, which are defined on a given set of databases .
Definition 1.A randomized mechanism is said to be -differentially private, or for short -DP, if for every adjacent , we have that
Here, we apply a relaxed version of the -DP that is more suitable for Gaussian mechanisms.
Definition 2.A randomized mechanism is said to be -zero-concentrated differentially private (CDP), or for short -zCDP, if
2.2 Federated learning model
Based on the notions from previous section, we introduce our privacy preserving FL model for a system with multiple base stations. Let a collection of base stations denoted by the set be given such that they can communicate with each other through a main server. Assume that each base station serves a set of edge devices (users) denoted by , where the users in have some arbitrary order. Let denote the size of this set.
Figure 1 shows an example of a model with multiple base stations. In this example, all users of cell (depicted on the bottom of the figure), are scheduled to participate in the FL process and receive the vector .

ALGORITHM 1. Privacy preserving federated learning with multiple stations.
1: | The main server broadcasts , which are given by (3) and (6), to all base stations and their users. |
2: | The main server initializes the global model . |
3: | for do |
4: | The main server broadcasts to all base stations. |
5: | for base stations in parallel do |
6: | Base station broadcasts to all its users. |
7: | for users in parallel do |
8: | if then |
9: | The user updates its model as in (7). |
10: | The user then sends back to the base station . |
11: | end if |
12: | end for |
13: | The base station aggregates the received models as in (8). |
14: | The base station then sends back to the main server. |
15: | end for |
16: | The main server aggregates all models as in (9). |
17: | end for |
Next, the main server broadcasts the new global model to the base stations where it is then forwarded further to their corresponding users. This process continues for a given number of iterations. Algorithm 1 summarizes these steps, where are assumed to be shared with all participants at the beginning of the learning process.
One important difference between Algorithm 1 and other approaches, for example, the FL schemes in references [8, 27], is that here the aggregation is done in two steps. Additionally, the DP noise standard deviations at users are not necessarily identical here and a joint optimal user scheduling and DP noise adjustment is possible.
To characterize the DP noise, we need to make the following assumption which can be achieved in practice by weight clipping [27, 45].
Assumption 1.The gradients of the local loss functions are always upper bounded, that is,
3 CONVERGENCE ANALYSIS
Here, we define the global loss as a function of local losses. We then derive an upper bound on the optimality gap that appears in each round of Algorithm 1.
The following assumptions are necessary to analyse the global loss function and have been used before in the literature [46].
Assumption 2.The loss function has a minimum value, that is, there exists an input vector .
Assumption 3.The gradient is uniformly -Lipschitz continuous with respect to the model , that is,
Assumption 4.The loss function is -strongly convex, that is,
Assumption 5.The loss function is twice continuously differentiable. Then, Assumptions 3 and 4 are equivalent to the following:
Assumption 6.There exists constants and , such that for any training sample and model , the following inequality holds
Several widely used loss functions have been provided in reference [46] that satisfy these assumptions, for example, mean squared error or cross-entropy loss functions.
Now, we are ready to derive an upper bound on the optimality gap of Algorithm 1.
Theorem 1.Let Assumptions 2–6 hold. Then, the following upper bound on the optimality gap for Algorithm 1 holds:
Proof.The proof is provided in the Appendix.
Theorem 1 shows that the expected difference between the global loss and the optimal value per iteration is upper bounded by expressions that depend on , , and . Hence, by lowering the values of , , and , the convergence of Algorithm 1 should be improved. The terms and are influenced by the scheduling variable since other terms are considered to be constant. Moreover, the term depends on both the scheduling variable and the DP noise standard deviations .
Furthermore, we observe that the upper bound converges only if . This is because, if we apply the upper bound in Theorem 1 recursively for rounds, then we obtain a coefficient as part of the final upper bound, which converges if . In Section 4, we design a scheduler and a DP optimizer based on these variables and their effect on this upper bound. To this end, we jointly minimize the values of and the total privacy leakage given by (12).
4 LEARNING OVER WIRELESS CHANNELS WITH INTER-CELL INTERFERENCE
Here, we consider other wireless parameters of the communication system and connect them to the notion of learning. These wireless parameters include resource allocation, transmit power consumption, fading channels, inter-cell interference, and communication rate.
We assume that the users apply an orthogonal frequency-division multiple access (OFDMA) technique in the uplink channel to transmit data to their corresponding base station [47]. Each edge device is assigned a resource block indexed by , where is the total number of available uplink transmission resource blocks in each cell. Due to this reason, we assume that the intra-cell interference is mitigated by the scheduler.
The downside of allocating different frequency bands to users of the same cell is that not all edge devices can participate in the learning process. This is because the number of resource blocks and the available bandwidth are limited. However, we show here that we can achieve a sub-optimal learning result by selecting only those users that contribute the most to the learning process.
Figure 2 illustrates an example of a wireless communication system with three base stations , and in the uplink stage. Here, the received signals on the resource block at the base station are affected by the interference signal power from cell . Furthermore, base station is affected by the interference signal power from cell .

We note that the thermal noise is added to the received signal, which is already channel encoded. On the contrary, the DP noise is added to the source information before any channel coding is performed. As a result, only the thermal noise appears in (19).
The parameters and play a critical role in improving the convergence rate of Algorithm 1 (cf. Theorem 1) and reducing the total privacy leakage. Furthermore, the parameter is critical in establishing a reliable communication. By minimizing and with respect to these parameters, the upper bound on the optimality gap in Theorem 1 reduces and thus the convergence rate of the FL procedure should improve. To this end, it is sufficient to minimize only the expressions inside the squared term in .
Therefore, we propose an optimization problem over the variables and minimize the values of and from Theorem 1 and the total privacy leakage given by (12). In this combined formulation, we assume that other FL parameters, such as , and , are constant. The main server can then solve this optimization problem and then broadcast the results to all base stations before Algorithm 1 starts.
Minimizing the first term in the objective function in (20) improves the convergence of Algorithm 1 and is computed by applying (15) to the summation term in of Theorem 1. Minimizing the second term, on the other hand, reduces the total privacy leakage at all users and is given by (12).
In the optimization problem (20), edge devices who have a larger number of samples and a better uplink communication channel have generally a higher chance to be scheduled and assigned less DP noise power.
Conditions (23) and (24) provide the resource allocation constraints, whereas (25) and (26) restrict the transmit power to a maximum amount and ensure a minimum communication rate for each user in each cell, respectively. Finally, constraint (22) guarantees an upper bound on the privacy leakage of the users individually due to (11). Here, the constant controls the minimum amount of DP noise at each user.
We notice that the variable and , which are given by (3) and (14), are related due to (15). Therefore, does not appear as a minimization variable.
The optimization problem in (20) is not easy to solve. However, we can subdivide it into simpler problems and search for (sub-)optimal solutions. The main server can then compute and broadcast these (sub-)optimal to all base stations where they can be forwarded to the users. These computations and initialization should be done prior to the beginning of Algorithm 1.
5 ALGORITHM DESIGN
Here, we propose two sub-optimal sequential algorithms to solve the optimization problem in (20). First, for fixed DP noise the objective function in (20) is minimized with respect to users' transmit powers and resource block allocation in a cell-by-cell manner. In the second part, with given transmit power and resource block allocation, the optimization problem in (20) becomes convex with respect to the DP noise standard deviations.
5.1 Optimal scheduler
To solve (30), we assume that are known and satisfy (23)–(25). We then solve this problem with respect to while taking with as constants. By solving this optimization problem for each cell, we obtain a (sub-)optimal scheduling solution for the whole system.
After finding the optimal powers from (35), we can compute the uplink communication rates by using (19). We then unschedule those users whose rates do not satisfy (26) and set their transmit power to zero.
ALGORITHM 2. Random scheduler with random differential privacy noise (RndSched).
ALGORITHM 3. Optimal scheduler with random differential privacy noise (OptSched).
1: | Initialize the values of randomly such that they satisfy (21)-(25). |
2: | for do |
3: | For fixed and , obtain a (sub-)optimal resource block allocation matrix by solving the optimization problem in (30). |
4: | end for |
5: | Compute by solving (35) and unschedule those users whose communication rates do not meet (26). |
6: | Output the resulting parameters as a (sub-)optimal solution . |
ALGORITHM 4. Optimal scheduler with differential privacy noise optimizer (OptSched+DP).
Based on these solutions, we propose two procedures for user scheduling and DP noise adjustment. Algorithm 2 presents a random scheduler (RndSched). Algorithm 3 provides an OptSched based on (30). Both algorithms benefit from the power allocation procedure based on (35) and both apply random DP noise to achieve privacy.
We note that one advantage of the OptSched is that it is linear and therefore efficient from a practical point of view compared with (28). Nevertheless, the drawback of this approach is that it is performed sequentially and cell by cell. As a result, there is no guarantee that this approach always provides us with an optimal solution. However, as we will see in Section 6.1, it delivers very good results compared with the randomized scheduler. In the next subsection, we extend this algorithm to include a DP optimizer.
5.2 DP optimizer
After the value of is found from (40), the optimal can be computed from (39). We then combine this scheme with the procedure in Section 5.1. A summary of this scheme is provided in Algorithm 4 (OptSched+DP).
6 SIMULATIONS AND NUMERICAL SOLUTIONS
6.1 Optimization problems
Here, we present the numerical solutions of the algorithms that were presented in Section 5. In this regard, we apply the Python optimization packages CVXPY, CVXOPT, GLPK, and ECOS [35-39].
Since Algorithms 3 and 4 are heuristic, their solutions depend on the initial values of the optimizing variables as well as the wireless channels and the number of training samples at each user. As a result, we repeat the computations for several random initial values, channels, and data distributions among the users and then compute the average.
To this end, the variables are first initialized based on a shuffled round-robin scheme and are set uniformly at random such that and hold. Second, the users are positioned in a square area consisting of seven hexagon cells according to a uniform distribution. The edge devices are then assigned to their nearest base stations according to their random position. Based on their distances to the base stations, their fading channels are then computed by applying (18).
An example of channel initialization, which is generated by our simulator in Python language, is shown in Figure 3. Here, the channels between one of the users and base stations are depicted as dashed lines. We notice that the cells 1–6 in this setting can cover users also outside their area while the central cell only covers devices inside the central hexagon. As a result, the effects of boundary and central cells are both taken into account in our simulations.

After the users are assigned to their corresponding base stations, the training data is randomly distributed among all users. Inspired by reference [51], the number of samples are determined by a lognormal distribution. Algorithms 3 and 4 should then provide us with (sub-)optimal values and , respectively.
The system parameters that are used in the computations are listed in Table 2. Figure 4 shows the results of all algorithms in the form of an empirical cumulative distribution function (CDF) of the normalized objective value in (20). The normalization is done by dividing the value of the objective function by the total number of samples (scheduled or unscheduled). The CDF is computed for two values of available number of resource blocks and the optimization constant from (20). The results are averaged over random channels and initial values. As seen in Figure 4, the OptSched (Algorithm 3) outperforms the RndSched (Algorithm 2) in terms of minimizing the objective value in (20). Moreover, the OptSched+DP (Algorithm 4) further improves the results of the OptSched by reducing the total privacy leakage. Furthermore, the OptSched+DP achieves lower values for compared with the case in which . This is because, larger in (20) gives more weight to the DP noise optimization.
System parameters | Values |
---|---|
Number of cells or base stations | 7 |
Total number of users | 100 |
Cell radius | 500 m |
Uplink center frequency () | 2450 MHz |
Channels' Rayleigh distribution scale parameter | 1 |
Uplink resource block bandwidth (B) | |
Thermal noise power spectral density | |
Maximum transmit power | |
Minimum communication rate | |
DP noise error upper bound | 12 |
Minimum total DP noise at each user | 100 |

We also notice that by increasing from 5 to 8, the normalized objective values of the RndSched get slightly closer to the outcome of the OptSched algorithm. This is due to the fact that by increasing and keeping the total number of users constant, chances that all users are successfully scheduled by RndSched become higher. Here, for large values of , the RndSched might eventually achieves the same performance as for OptSched. However, the choice of selecting a large number of resource blocks for a low number of users is not desirable due to the limited amount of available bandwidth.
6.2 Federated learning simulations
Here, we apply the random parameters and as well as (sub-)optimal and from Section 6.1 to an FL system as described in Algorithm 1. We assume that the main server and all users each maintain a fully connected neural network in the form of a multi-label classifier. The networks consist of two hidden layers, each with 256 nodes. To implement the simulations, we apply the TensorFlow, NumPy, and Matplotlib packages [40-42]. Furthermore, we use the Modified National Institute of Standards and Technology (MNIST) image dataset [52] to train and test the multi-label classifier.
We train the local models over communication rounds between users and the main server. To follow our mathematical model in Section 2, we perform no local iterations and use the batch gradient descent scheme. We do not apply any decay and use a fixed learning rate .
Furthermore, to guarantee that Assumption 1 holds, the gradient of all weights are clipped so that their global norm is smaller than or equal to . This directly affects the amount of privacy leakage as given by (11).
We perform the simulations over 100 channels and initial values and then average the resulting accuracy and loss. Furthermore, we generate the empirical CDF of the privacy leakage of all users. Figure 5 shows the accuracy, loss, and privacy leakage CDF of this learning system for different values of available resource blocks and optimization constant .

As seen in Figure 5, the OptSched outperforms the RndSched algorithm in terms of accuracy and loss for both and . Here, the OptSched systematically selects the users with large chunks of data that have a better channel and suffer less from the inter-cell interference. The RndSched algorithm, however, fails in this scenario since it applies a random scheduling scheme.
The OptSched+DP, on the other hand, slightly degrades the performance of the OptSched by increasing and optimizing the DP noise. Yet OptSched+DP provides a similar or even better performance compared with RndSched scheme for small values of (see Figures 5(a) and 5(c)). The degradation is the price that is paid to improve the privacy. Figures 5(e) and 5(f) show the empirical CDF of the privacy leakage. Here, the value of at each user and cell is computed by using (11) and the results over all simulation iterations are collected to compute the CDF. The simulations show that the OptSched+DP scheme substantially reduces the amount of privacy leakage at each user. In particular it achieves a maximum privacy leakage of around thanks to the DP optimizer scheme. This is a significant improvement compared with the RndSched scheme with a maximum leakage of around .
Adjusting the optimization constant is also crucial. In this regard, by choosing the OptSched achieves lower privacy leakage compared with (see Figures 5 and 5). This is because larger gives less weight to the OptSched in (20) and users with higher DP noise power are preferred in scheduling.
7 CONCLUSION
Here, a privacy preserving FL procedure in a multiple base stations scenario with inter-cell interference has been considered. An upper bound on the optimality gap of the convergence term of this learning scheme has been derived and an optimization problem to reduce this upper bound has been provided. We have proposed two sequential algorithms to obtain (sub-)optimal solutions for this optimization task; namely an OptSched in Algorithm 3 and its extended version with DP optimizer (OptSched+DP) in Algorithm 4. In designing these schemes we avoid non-linearity in the integer programming problems. The outputs of these algorithms are then applied to an FL system.
Simulation results have shown that the OptSched increases the accuracy of the classification FL system and reduces the loss compared with the RndSched when the number of available resource blocks is small. Here, when the total number of users is and , the OptSched shows an accuracy improvement of over . Simulations have further shown that the OptSched not only improves the accuracy but also can reduce the privacy leakage compared with the RndSched if the parameter is set properly.
The OptSched+DP, on the other hand, further optimizes the DP noise and substantially reduces the privacy leakage compared with both RndSched and OptSched. Here, simulations have shown that the OptSched+DP reduces the maximum privacy leakage for both and by a factor of 8 (from to ). It is worth mentioning that when is small (e.g. ), this improvement is achieved while OptSched+DP shows a similar or even better performance in terms of accuracy and loss compared with RndSched.
AUTHOR CONTRIBUTIONS
Nima Tavangaran: Conceptualization; formal analysis; methodology; project administration; software; visualization; writing—original draft. Mingzhe Chen: Conceptualization; formal analysis; methodology; resources; validation. Zhaohui Yang: Conceptualization; resources; validation. José Mairton B. Da Silva Jr.: Conceptualization; formal analysis; resources; software; validation. H. Vincent Poor: Funding acquisition; resources; supervision; validation.
ACKNOWLEDGEMENTS
The work of Nima Tavangaran was partly supported by the German Research Foundation (DFG) under Grant TA 1431/1-1. The work of Mingzhe Chen was partly supported by the U.S. National Science Foundation under grant CNS-2312139. The work of José Mairton B. Da Silva Jr. was jointly supported by the European Union's Horizon Europe research and innovation program under the Marie Skłodowska-Curie project FLASH, with grant agreement No. 101067652; the Ericsson Research Foundation; and the Hans Werthén Foundation. The work of H. Vincent Poor was supported by the U.S. National Science Foundation under grants CCF-1908308 and CNS-2128448.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
APPENDIX
Proof of Theorem 1
Proof.It follows by using Assumption 5 and applying the Taylor expansion to the global loss function that
Next, we compute the local updates at each user by combing (4), (5), and (7) as follows
We then obtain the global update at the main server by inserting the value of from (A2) into (A3) as follows
To simplify the rest of calculations, we define a new random variable to reflect the difference between the global update and the global gradient as:
Now by inserting the term (the global update) from (A5) into (A1), we have that
Furthermore, the following identity always holds:
Considering the learning step size to be and applying the identity (A7) to (A6), it follows that
Inspired by reference [8], we first obtain an upper bound on the expectation of the term on the right-hand side of (A8). We have by combining (A4) and (A5) that
Next, by applying Assumption 6 to (A10) we have for some and that
On the other hand, since is both uniformly -Lipschitz and -strongly convex (Assumptions 3 and 4) we have [46] that
Next, we insert (A11) in (A8). The proof then follows by using (A12) and (A13) and the fact that (Assumption 5).
Open Research
DATA AVAILABILITY STATEMENT
The dataset used in the findings of this study are available in: Y. LeCun, ‘The MNIST database of handwritten digits’, 1998, http://yann.lecun.com/exdb/mnist