Volume 18, Issue 20 p. 1853-1867
ORIGINAL RESEARCH
Open Access

On differential privacy for federated learning in wireless systems with multiple base stations

Nima Tavangaran

Nima Tavangaran

Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey, USA

Contribution: Conceptualization, Formal analysis, Methodology, Project administration, Software, Visualization, Writing - original draft

Search for more papers by this author
Mingzhe Chen

Mingzhe Chen

Department of Electrical and Computer Engineering and Institute for Data Science and Computing, University of Miami, Coral Gables, Florida, USA

Contribution: Conceptualization, Formal analysis, Methodology, Resources, Validation

Search for more papers by this author
Zhaohui Yang

Zhaohui Yang

College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou, China

Contribution: Conceptualization, Resources, Validation

Search for more papers by this author
José Mairton B. Da Silva Jr.

José Mairton B. Da Silva Jr.

Department of Information Technology, Uppsala University, Uppsala, Sweden

Contribution: Conceptualization, Formal analysis, Resources, Software, Validation

Search for more papers by this author
H. Vincent Poor

Corresponding Author

H. Vincent Poor

Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey, USA

Correspondence

H. Vincent Poor, Department of Electrical and Computer Engineering, Princeton University, Princeton, New Jersey, USA.

Email: [email protected]

Contribution: Funding acquisition, Resources, Supervision, Validation

Search for more papers by this author
First published: 17 January 2024
Citations: 2

Abstract

In this work, we consider a federated learning model in a wireless system with multiple base stations and inter-cell interference. We apply a differentially private scheme to transmit information from users to their corresponding base station during the learning phase. We show the convergence behavior of the learning process by deriving an upper bound on its optimality gap. Furthermore, we define an optimization problem to reduce this upper bound and the total privacy leakage. To find the locally optimal solutions of this problem, we first propose an algorithm that schedules the resource blocks and users. We then extend this scheme to reduce the total privacy leakage by optimizing the differential privacy artificial noise. We apply the solutions of these two procedures as parameters of a federated learning system where each user is equipped with a classifier and communication cells have mostly fewer resource blocks than numbers of users. The simulation results show that our proposed scheduler improves the average accuracy of the predictions compared with a random scheduler. In particular, the results show an improvement of over 6%. Furthermore, its extended version with noise optimizer significantly reduces the amount of privacy leakage.

1 INTRODUCTION

Machine learning (ML) systems are expected to play an important role in future mobile communication standards [1]. With increasing applications of ML schemes in wireless systems, new technologies are emerging to enhance the performance of such systems. On the other hand the wireless technology itself can also be deployed to enhance the ML procedures [2]. Among possible candidates, federated learning (FL) has been shown to have considerable promise [3-5] and has the potential to benefit from wireless communication.

FL solves several issues of centralized ML systems by distributing the learning task among several edge devices. One advantage of using an FL system, which makes it a good fit in a wireless setting, is that edge devices do not need to transmit their local datasets to the server. This reduces the amount of wireless resources that is required for accomplishing the given ML task. Apart from this, the privacy of each edge device is not completely compromised since the server does not have a direct access to the data [6].

FL schemes operating over wireless networks have been extensively researched in recent years; see, for example, references [7-13]. In reference [8], the authors studied the effects of wireless parameters on the FL process. They derived an upper bound on the optimality gap of the convergence terms and proposed an optimization problem to minimize the upper bound by considering wireless parameters like resource allocation, user scheduling, and packet error rate. Some other works that studied the communication aspects of FL are references [14-17]. The work in reference [17] considers a wireless FL and applies client scheduling to minimize the training latency for a given training loss. Moreover, FL with several layers of aggregation or with hierarchy has been studied in references [18-23].

Although the training data of each device in FL is not transmitted to the server, yet a function of the local model (query) is still sent to the server. It has been shown that this local model might leak some information about the training data [24]. To mitigate this drawback, FL has been extensively studied together with a privacy preserving scheme called differential privacy (DP) [25].

DP-based schemes follow the principle of not being adversely affected much by having one's data used in any analysis [26]. This powerful notion is well established and is applied in the industry. To realize a DP-based FL system, each edge device adds some artificial noise to its transmitting information. This noise provides a certain amount of privacy depending on the noise power and sensitivity of the query function.

DP-based FL and its convergence behaviour have been extensively studied; see, for example, the works in references [27-34]. In this regard, the work in reference [27] addresses the privacy implementation challenges through a combination of zero-concentrated differential privacy, local gradient perturbation, and secure aggregation. Meanwhile, the work in reference [31] considers the resource allocation and user scheduling to minimize the FL training delay under the constraint of performance and DP requirements. In reference [33], decentralized FL algorithms in wireless Internet of Things (IoT) with DP have been studied. Finally, reference [34] presents a closed-form global loss and privacy leakage of a DP-based FL system and then minimizes the loss and privacy leakage.

However, none of these works consider a joint learning and resource allocation scheme for DP-based FL that considers the effects of inter-cell interference. Here, we adopt the framework in reference [8] and consider a wireless FL system in a multiple base stations scenario. Additionally, we consider DP noise added to the average of gradients [27] and combine this approach with resource scheduling.

The goal of the FL system here is to train a global model for a given predictor. We introduce an iterative DP-based FL scheme with two levels of aggregation (Algorithm 1) and then derive an upper bound on the optimality gap of its convergence terms (Theorem 1).

We then propose an optimization problem whose goal is to improve the convergence of the upper bound on the optimality gap of Algorithm 1 and simultaneously reduce the total privacy leakage. In this regard, the optimization problem is with respect to certain variables like user and resource scheduling, uplink transmit powers, and the amount of DP noise that is applied by each user to its transmitting information.

Since the proposed optimization problem is a non-linear multi-variable mixed integer programming, we divide it into two simpler schemes.

First, we present a sub-optimal approach to minimize the objective function only with respect to resource scheduling variables in a sequential manner, that is, cell by cell. This reduces the original problem to a linear integer programming task and substantially simplifies the implementation. We call this scheme also the optimal scheduler (OptSched). Since this approach performs sequentially from one cell to another one, the amounts of optimal transmit power should be adjusted carefully due to the effects of inter-cell interference. To tackle this problem, we introduce a procedure to determine the users' optimal transmit powers by solving a simple optimization problem.

Next, we enhance the OptSched scheme by further minimizing the objective function of the proposed optimization problem with respect to the DP noise. This leads us to a convex optimization problem with respect to the DP noise standard deviations. We call this extended scheme the optimal scheduler with DP optimizer (OptSched+DP).

We present all the numerical optimizations and benchmarking results. In this regard, we apply Python optimization packages like CVXPY, CVXOPT, GLPK, and ECOS [35-39]. The numerical results show that our proposed schemes (OptSched and OptSched+DP) reduce the objective function of the optimization task substantially compared with the case in which we randomly allocate the resources and apply the DP noise.

Next, we apply these (sub-)optimal parameters to our iterative learning scheme (Algorithm 1). In this regard, each user is equipped with a fully connected multi-layer neural network as a classifier. Furthermore, we assume that communication cells have mainly more users than available resource blocks. This is a legitimate assumption due to the bandwidth limitation. We then perform simulations to measure the accuracy, loss, and the amount of the privacy leakage in such a system for the proposed algorithms. To realize the simulations, we apply the TensorFlow, NumPy, and Matplotlib packages [40-42].

The simulations show that the OptSched scheme predominantly improves the classification accuracy by scheduling the users who have larger data chunks and better uplink channels. In our simulation settings, an average accuracy improvement of over 6 % $6\%$ is achieved compared with a random scheduler. The OptSched+DP scheme, on the other hand, achieves a significant reduction in privacy leakage of individual users by systematically adjusting the DP noise power and moderately sacrificing accuracy.

Notation.We denote vectors by lowercase bold letters, for example, w $\bm {w}$ . Matrices are represented by uppercase bold letters like X $\bm {X}$ , or the identity matrix I d $\bm {I}_d$ with d $d$ rows and d $d$ columns. Sets are denoted by Calligraphic fonts like X $\mathcal {X}$ . Random mechanisms as a special kind of functions are represented by Fraktur fonts, for example, M $\mathfrak {M}$ . The transpose of a vector x $\bm {x}$ is denoted by x $\bm {x}^\intercal$ . Logarithms are assumed to be to the basis 2. The set of real numbers is represented by R $\mathbb {R}$ . [ R ] $[R]$ denotes the set { 1 , 2 , , R } $\lbrace 1,2,\ldots,R\rbrace$ .

2 SYSTEM MODEL

We begin this section by reviewing some preliminary notions on DP that are required here. The complete list of definitions can be found in references [25, 26, 43]. Table 1 provides a list of major notation that is used throughout this work.

TABLE 1. List of notations.
Notation Description
a s $\bm {a}_s$ Vector of all a s , i $a_{s,i}$ in the cell s $s$
a s , i $a_{s,i}$ Scheduling of the user i U s $i\in \mathcal {U}_s$
c s , i $c_{s,i}$ Communication rate of the user i U s $i\in \mathcal {U}_s$
d $d$ Length (dimension) of models w s , i ( t ) $\bm {w}_{s,i}^{(t)}$ or w ( t ) $\bm {w}^{(t)}$
d s , i $d_{s,i}$ Distance of the user i U s $i\in \mathcal {U}_s$ to its base station
f ( w ( t ) ) $f(\bm {w}^{(t)})$ Global loss function at round t $t$
h s , i $h_{s,i}$ Channel between the user i U s $i\in \mathcal {U}_s$ and its station s $s$
I s ( n ) ( s ) $I_{s}^{(n)}(\tilde{s})$ Uplink interference power from the cell s $\tilde{s}$ at s $s$
K $K$ Total number of all samples
K a $K_{\mathrm{a}}$ Total number of scheduled samples
K s , i $K_{s,i}$ Number of data samples (rows) of X s , i $\bm {X}_{s,i}$
l ( w ( t ) , x s , i ( k ) ) $l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})$ Local loss function at the user i U s $i\in \mathcal {U}_s$
M $M$ Upper bound on the gradients of local losses
M s , i ( t ) ( X s , i ) $\mathfrak {M}_{s,i}^{(t)}(\bm {X}_{s,i})$ Mechanism at the user i U s $i\in \mathcal {U}_s$ at round t $t$
n s , i ( t ) $\bm {n}^{(t)}_{s,i}$ DP noise at the user i U s $i\in \mathcal {U}_s$ at round t $t$
p s $\bm {p}_s$ Vector of all p s , i $p_{s,i}$ in the cell s $s$
p s , i $p_{s,i}$ Transmit power at the user i U s $i\in \mathcal {U}_s$
q s , i ( t ) ( X s , i ) $q_{s,i}^{(t)}(\bm {X}_{s,i})$ Query function at the user i U s $i\in \mathcal {U}_s$ at round t $t$
R $R$ Number of available resource blocks in each cell
R s $\bm {R}_s$ Matrix of all r s , i ( n ) $r_{s,i}^{(n)}$ in the cell s $s$
r s , i ( n ) $r_{s,i}^{(n)}$ Scheduling of resource block n $n$ for i U s $i\in \mathcal {U}_s$
T $T$ Number of learning rounds (iterations)
U s $\mathcal {U}_s$ Set of users in the cell s S $s\in \mathcal {S}$
w ( t ) $\bm {w}^{(t)}$ Global model at round t $t$
w s , i ( t ) $\bm {w}_{s,i}^{(t)}$ Local model of the user i U s $i\in \mathcal {U}_s$ at round t $t$
X s , i $\bm {X}_{s,i}$ Database belonging to the user i U s $i\in \mathcal {U}_s$
x s , i ( k ) $\bm {x}_{s,i}^{(k)}$ k $k$ th data sample (row) of X s , i $\bm {X}_{s,i}$
γ $\gamma$ Optimization constant
λ $\lambda$ Learning step size
ρ s , i $\rho _{s,i}$ Privacy leakage at the user i U s $i\in \mathcal {U}_s$
σ s $\bm {\sigma }_s$ Vector of all σ s , i $\sigma _{s,i}$ in the cell s $s$
σ s , i $\sigma _{s,i}$ DP noise standard deviation at the user i U s $i\in \mathcal {U}_s$

2.1 Differential privacy model

Let a data universe X $\mathcal {X}$ and the distribution P X $P_X$ on it be given. Assume that a database is denoted by a matrix X X K × m $\bm {X}\in \mathcal {X}^{K\times m}$ and contains K $K$ rows of independent and identically distributed (i.i.d.) m $m$ -dimensional samples (row vectors). Two databases X , X X K × m $\bm {X},\tilde{\bm {X}}\in \mathcal {X}^{K\times m}$ are called adjacent if they differ only in one row.

A query (mechanism) q : X K × m R d $q:\mathcal {X}^{K\times m}\rightarrow \mathbb {R}^d$ is a function which takes a database X X K × m $\bm {X}\in \mathcal {X}^{K\times m}$ as input and gives a d $d$ -dimensional output. If the output of the query contains randomness then it is called a randomized mechanism.

In the following, we introduce the notion of privacy for randomized mechanisms, which are defined on a given set of databases X K × m $\mathcal {X}^{K\times m}$ .

Definition 1.A randomized mechanism M : X K × m R d $\mathfrak {M}:\mathcal {X}^{K\times m}\rightarrow \mathbb {R}^d$ is said to be ( ε , δ ) $(\epsilon,\delta)$ -differentially private, or for short ( ε , δ ) $(\epsilon,\delta)$ -DP, if for every adjacent X , X X K × m $\bm {X},\tilde{\bm {X}}\in \mathcal {X}^{K\times m}$ , we have that

Pr M ( X ) W e ε Pr M ( X ) W + δ $$\begin{align} \mathrm{Pr}{\left(\mathfrak {M}(\bm {X})\in \mathcal {W}\right)}\le \mathrm{e}^{\epsilon }\,\mathrm{Pr}{\left(\mathfrak {M}(\tilde{\bm {X}})\in \mathcal {W}\right)}+\delta \end{align}$$ (1)
holds for any W R d $\mathcal {W}\subset \mathbb {R}^d$ .

Here, we apply a relaxed version of the ( ε , δ ) $(\epsilon,\delta)$ -DP that is more suitable for Gaussian mechanisms.

Definition 2.A randomized mechanism M : X K × m R d $\mathfrak {M}:\mathcal {X}^{K\times m}\rightarrow \mathbb {R}^d$ is said to be ρ $\rho$ -zero-concentrated differentially private (CDP), or for short ρ $\rho$ -zCDP, if

D α M ( X ) M ( X ) ρ α $$\begin{align} \mathrm{D}_{\alpha }{\left(\mathfrak {M}(\bm {X})\,\Vert \,\mathfrak {M}(\tilde{\bm {X}})\right)}\le \rho \alpha \end{align}$$ (2)
holds for every adjacent X , X X K × m $\bm {X},\tilde{\bm {X}}\in \mathcal {X}^{K\times m}$ and all α ( 1 , ) $\alpha \in (1,\infty)$ , where D α $\mathrm{D}_{\alpha }$ is the α $\alpha$ -Rényi divergence [43].

2.2 Federated learning model

Based on the notions from previous section, we introduce our privacy preserving FL model for a system with multiple base stations. Let a collection of base stations denoted by the set S $\mathcal {S}$ be given such that they can communicate with each other through a main server. Assume that each base station s S $s\in \mathcal {S}$ serves a set of edge devices (users) denoted by U s $\mathcal {U}_s$ , where the users in U s $\mathcal {U}_s$ have some arbitrary order. Let U s $U_s$ denote the size of this set.

We assume that each user i U s $i\in \mathcal {U}_s$ assigned to the base station s $s$ has access to a database
X s , i : = x s , i ( 1 ) , x s , i ( 2 ) , , x s , i ( K s , i ) R K s , i × m , $$\begin{align*} \bm {X}_{s,i}&:= {\left(\bm {x}_{s,i}^{(1)},\bm {x}_{s,i}^{(2)},\ldots,\bm {x}_{s,i}^{(K_{s,i})}\right)}^\intercal \in \mathbb {R}^{K_{s,i}\times m}, \end{align*}$$
where K s , i $K_{s,i}$ is the number of samples (row vectors) in the database X s , i $\bm {X}_{s,i}$ . Each row of the above matrix, say x s , i ( k ) $\bm {x}_{s,i}^{(k)}$ , is an m $m$ -dimensional data sample given by
x s , i ( k ) : = x s , i ( k ) ( 1 ) , x s , i ( k ) ( 2 ) , , x s , i ( k ) ( m 1 ) , y s , i ( k ) , $$\begin{align*} \bm {x}_{s,i}^{(k)}:= {\left(x_{s,i}^{(k)}(1),x_{s,i}^{(k)}(2),\ldots,x_{s,i}^{(k)}(m-1),y_{s,i}^{(k)}\right)}, \end{align*}$$
where the first m 1 $m-1$ elements are the inputs and the last entry y s , i ( k ) $y_{s,i}^{(k)}$ is the output of the training data.
In the first step of the FL scheme at round t = 1 $t=1$ , the main server broadcasts a weight vector w ( t ) R d $\bm {w}^{(t)}\in \mathbb {R}^d$ to all base stations. This vector is called the global model and can be initialized randomly. Then, each base station s $s$ transmits this model to all of its edge devices. Let only a subset of the users in each cell be active and participate in the learning process. We denote the active users in cell s $s$ by:
a s : = ( a s , i ) i U s { 0 , 1 } U s , $$\begin{align} \bm {a}_s:= (a_{s,i})_{i\in \mathcal {U}_{s}}\in \lbrace 0,1\rbrace ^{U_{s}}, \end{align}$$ (3)
where a s , i = 1 $a_{s,i}=1$ indicates that user i U s $i\in \mathcal {U}_s$ is scheduled [8] to participate in the learning process and a s , i = 0 $a_{s,i}=0$ , otherwise.

Figure 1 shows an example of a model with multiple base stations. In this example, all users of cell s $s$ (depicted on the bottom of the figure), are scheduled to participate in the FL process and receive the vector w ( t ) $\bm {w}^{(t)}$ .

Details are in the caption following the image
Federated learning model with multiple base stations.
Each scheduled user computes a local loss function depending on the ML algorithm that is applied in the system. We denote the loss function of a user i U s $i\in \mathcal {U}_s$ by l ( w ( t ) , x s , i ( k ) ) $l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})$ , which is a function of the global model and its training sample. Next, this user computes the gradient [4, 44] of its loss function over all given samples as a query function
q s , i ( t ) ( X s , i ) : = 1 K s , i k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) , $$\begin{align} q_{s,i}^{(t)}(\bm {X}_{s,i}):= \frac{1}{K_{s,i}}\sum _{k=1}^{K_{s,i}}\nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)}), \end{align}$$ (4)
where the gradients are with respect to w ( t ) $\bm {w}^{(t)}$ .
We consider a privacy model similar to that in reference [27], in which the user applies Gaussian noise n s , i ( t ) N ( 0 , σ s , i 2 I d ) $\bm {n}^{(t)}_{s,i}\sim \mathcal {N}(\bm {0},\sigma _{s,i}^2\bm {I}_d)$ to the outcome of the query to implement the randomized mechanism M s , i ( t ) $\mathfrak {M}_{s,i}^{(t)}$ as follows
M s , i ( t ) ( X s , i ) : = q s , i ( t ) ( X s , i ) + n s , i ( t ) . $$\begin{align} \mathfrak {M}_{s,i}^{(t)}(\bm {X}_{s,i}):= q_{s,i}^{(t)}(\bm {X}_{s,i})+\bm {n}^{(t)}_{s,i}. \end{align}$$ (5)
In this context, n s , i ( t ) $\bm {n}^{(t)}_{s,i}$ is assumed to be independent of all other random variables in our model, including the DP noise that is applied in previous iterations. The main reason of applying Gaussian noise is that it gives tight bounds when applied with zCDP [43]. The following vector denotes the noise standard deviations of the users in the cell s $s$ :
σ s : = ( σ s , i ) i U s . $$\begin{align} \bm{\sigma }_s:= (\sigma _{s,i})_{i\in \mathcal {U}_s}. \end{align}$$ (6)
Next, the edge device i U s $i\in \mathcal {U}_s$ updates its local model by
w s , i ( t + 1 ) : = w ( t ) λ M s , i ( t ) ( X s , i ) , $$\begin{align} \bm {w}_{s,i}^{(t+1)}:= \bm {w}^{(t)}-\lambda \mathfrak {M}_{s,i}^{(t)}(\bm {X}_{s,i}), \end{align}$$ (7)
where λ > 0 $\lambda >0$ is the learning step size. It then transmits its updated model1 w s , i ( t + 1 ) $\bm {w}_{s,i}^{(t+1)}$ to its corresponding base station s $s$ .
In the next step, base station s $s$ aggregates all received updated models w s , i ( t + 1 ) $\bm {w}_{s,i}^{(t+1)}$ as given below:
w s ( t + 1 ) : = 1 i U s K s , i a s , i i U s K s , i a s , i w s , i ( t + 1 ) , $$\begin{align} \bm {w}_s^{(t+1)}:= \frac{1}{\sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}}\sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}\bm {w}_{s,i}^{(t+1)}, \end{align}$$ (8)
where a s , i $a_{s,i}$ is the scheduling parameter and was defined as an element of the vector a s $\bm {a}_s$ in Equation (3).
Consequently, all base stations send their aggregated models to the main server. There, the global model at round t + 1 $t+1$ is computed as follows
w ( t + 1 ) : = 1 K a s S i U s K s , i a s , i w s ( t + 1 ) , $$\begin{align} \bm {w}^{(t+1)}:= \frac{1}{K_{\mathrm{a}}}\sum _{s\in \mathcal {S}}{\left(\sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}\right)}\bm {w}_s^{(t+1)}, \end{align}$$ (9)
where
K a : = s S i U s K s , i a s , i $$\begin{align} K_{\mathrm{a}}:= \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i} \end{align}$$ (10)
is the total number of training samples of all scheduled users.

ALGORITHM 1. Privacy preserving federated learning with multiple stations.

1: The main server broadcasts ( a s , σ s ) s S $(\bm {a}_s,\bm{\sigma }_s)_{s\in \mathcal {S}}$ , which are given by (3) and (6), to all base stations and their users.
2: The main server initializes the global model w ( 0 ) $\bm {w}^{(0)}$ .
3: for t = 0 : T $t=0:T$ do
4: The main server broadcasts w ( t ) $\bm {w}^{(t)}$ to all base stations.
5: for base stations s S $s\in \mathcal {S}$ in parallel do
6: Base station s $s$ broadcasts w ( t ) $\bm {w}^{(t)}$ to all its users.
7: for users i U s $i\in \mathcal {U}_s$ in parallel do
8: if a s , i = 1 $a_{s,i}=1$ then
9: The user i U s $i\in \mathcal {U}_s$ updates its model as in (7).
10: The user i U s $i\in \mathcal {U}_s$ then sends w s , i ( t + 1 ) $\bm {w}_{s,i}^{(t+1)}$ back to the base station s $s$ .
11: end if
12: end for
13: The base station s $s$ aggregates the received models as in (8).
14: The base station s $s$ then sends w s ( t + 1 ) $\bm {w}_s^{(t+1)}$ back to the main server.
15: end for
16: The main server aggregates all models as in (9).
17: end for

Next, the main server broadcasts the new global model w ( t + 1 ) $\bm {w}^{(t+1)}$ to the base stations where it is then forwarded further to their corresponding users. This process continues for a given number of T $T$ iterations. Algorithm 1 summarizes these steps, where ( a s , σ s ) s S $(\bm {a}_s,\bm{\sigma }_s)_{s\in \mathcal {S}}$ are assumed to be shared with all participants at the beginning of the learning process.

One important difference between Algorithm 1 and other approaches, for example, the FL schemes in references [8, 27], is that here the aggregation is done in two steps. Additionally, the DP noise standard deviations σ s , i $\sigma _{s,i}$ at users are not necessarily identical here and a joint optimal user scheduling and DP noise adjustment is possible.

To characterize the DP noise, we need to make the following assumption which can be achieved in practice by weight clipping [27, 45].

Assumption 1.The gradients of the local loss functions are always upper bounded, that is,

l ( w , x ) 2 M $$\begin{align*} \Vert \nabla l(\bm {w}, \bm {x})\Vert _2 \le M \end{align*}$$
holds for any training sample x $\bm {x}$ and model w R d $\bm {w}\in \mathbb {R}^d$ .

In reference [27], it was shown that if Assumption 1 holds, then after T $T$ iterations, a mechanism like M s , i ( t ) ( X s , i ) $\mathfrak {M}_{s,i}^{(t)}(\bm {X}_{s,i})$ is ρ s , i $\rho _{s,i}$ -zCDP where
ρ s , i = 2 T M K s , i σ s , i 2 $$\begin{align} \rho _{s,i} = 2T{\left(\frac{M}{K_{s,i}\sigma _{s,i}}\right)}^2 \end{align}$$ (11)
is the privacy leakage.
Next, we define the total privacy leakage of the whole system as follows:
2 T M 2 s S i U s 1 K s , i σ s , i 2 a s , i . $$\begin{align} 2TM^2\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}{\left(\frac{1}{K_{s,i}\sigma _{s,i}}\right)}^2 a_{s,i}. \end{align}$$ (12)
In this definition, the summands in (12) are computed by multiplying the privacy leakage ρ s , i $\rho _{s,i}$ , given by (11), at each user with the scheduling variables a s , i $a_{s,i}$ . As a result, we consider the privacy leakage of only scheduled users and ignore non-scheduled edge devices.

3 CONVERGENCE ANALYSIS

Here, we define the global loss as a function of local losses. We then derive an upper bound on the optimality gap that appears in each round of Algorithm 1.

The global loss function is computed over all base stations and is given by
f ( w ( t ) ) : = 1 K s S i U s k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) , $$\begin{align} f(\bm {w}^{(t)}):= \frac{1}{K}\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s} \sum _{k=1}^{K_{s,i}}l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)}), \end{align}$$ (13)
where K = s S i U s K s , i $K=\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i}$ is the total number of samples (including scheduled or non-scheduled).

The following assumptions are necessary to analyse the global loss function f $f$ and have been used before in the literature [46].

Assumption 2.The loss function f : R d R $f:\mathbb {R}^d\rightarrow \mathbb {R}$ has a minimum value, that is, there exists an input vector w = arg min w R d ( f ( w ) ) $\bm {w}^*=\arg \min _{\bm {w}\in \mathbb {R}^d}(f(\bm {w}))$ .

Assumption 3.The gradient f ( w ) $\nabla f (\bm {w})$ is uniformly L $L$ -Lipschitz continuous with respect to the model w $\bm {w}$ , that is,

f ( w ) f ( w ) 2 L w w 2 , for all w , w R d . $$\begin{align*} \Vert \nabla f(\bm {w})-\nabla f(\bm {w}^{\prime })\Vert _2\le L\Vert \bm {w}-\bm {w}^{\prime }\Vert _2,\quad \text{ for all }\bm {w},\bm {w}^{\prime }\in \mathbb {R}^d. \end{align*}$$

Assumption 4.The loss function f : R d R $f:\mathbb {R}^d\rightarrow \mathbb {R}$ is μ $\mu$ -strongly convex, that is,

f ( w ) f ( w ) + ( w w ) f ( w ) + 1 2 μ w w 2 2 $$\begin{align*} f(\bm {w})\ge f(\bm {w}^{\prime })+(\bm {w}-\bm {w}^{\prime })^\intercal \nabla f(\bm {w}^{\prime })+\frac{1}{2}\mu \Vert \bm {w}-\bm {w}^{\prime }\Vert _2^2 \end{align*}$$
holds for all w , w R d $\bm {w},\bm {w}^{\prime }\in \mathbb {R}^d$ .

Assumption 5.The loss function f : R d R $f:\mathbb {R}^d\rightarrow \mathbb {R}$ is twice continuously differentiable. Then, Assumptions 3 and 4 are equivalent to the following:

μ I d 2 f ( w ) L I d , for all w R d . $$\begin{align*} \mu \bm {I}_d\preceq \nabla ^2 f(\bm {w})\preceq L\bm {I}_d,\qquad \text{ for all }\bm {w}\in \mathbb {R}^d. \end{align*}$$

Assumption 6.There exists constants ξ 1 0 $\xi _1\ge 0$ and ξ 2 1 $\xi _2\ge 1$ , such that for any training sample x $\bm {x}$ and model w R d $\bm {w}\in \mathbb {R}^d$ , the following inequality holds

l ( w , x ) 2 2 ξ 1 + ξ 2 f ( w ) 2 2 . $$\begin{align*} \Vert \nabla l(\bm {w}, \bm {x})\Vert _2^2\le \xi _1+\xi _2\Vert \nabla f(\bm {w})\Vert _2^2. \end{align*}$$

Several widely used loss functions have been provided in reference [46] that satisfy these assumptions, for example, mean squared error or cross-entropy loss functions.

Now, we are ready to derive an upper bound on the optimality gap of Algorithm 1.

Theorem 1.Let Assumptions 26 hold. Then, the following upper bound on the optimality gap for Algorithm 1 holds:

E f ( w t + 1 ) f ( w ) C 1 E f ( w ( t ) ) f ( w ) + C 2 + C 3 , $$\begin{align*} \mathbb {E}{\left[f(\bm {w}^{t+1})-f(\bm {w}^*)\right]}\le C_1\mathbb {E}{\left[f(\bm {w}^{(t)})-f(\bm {w}^*)\right]}+C_2 +C_3, \end{align*}$$
where the expectation is taken over the DP noise and
C 1 = 1 μ L + 4 ξ 2 K 2 s S i U s K s , i 1 a s , i 2 , C 2 = 2 ξ 1 L K 2 s S i U s K s , i 1 a s , i 2 , C 3 = d 2 L s S i U s K s , i a s , i K a σ s , i 2 . $$\begin{align*} C_1&=1-\frac{\mu }{L}+\frac{4\xi _2}{K^2}{\left(\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i} {\left(1-a_{s,i}\right)}\right)}^2,\nonumber \\ C_2&=\frac{2\xi _1}{LK^2} {\left(\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i} {\left(1-a_{s,i}\right)}\right)}^2,\nonumber \\ C_3&=\frac{d}{2L}\sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}{\left(\frac{K_{s,i}a_{s,i}}{K_{\mathrm{a}}}\sigma _{s,i}\right)}^2. \end{align*}$$

Proof.The proof is provided in the Appendix. $\Box$

Theorem 1 shows that the expected difference between the global loss and the optimal value f ( w ) $f(\bm {w}^*)$ per iteration is upper bounded by expressions that depend on C 1 $C_1$ , C 2 $C_2$ , and C 3 $C_3$ . Hence, by lowering the values of C 1 $C_1$ , C 2 $C_2$ , and C 3 $C_3$ , the convergence of Algorithm 1 should be improved. The terms C 1 $C_1$ and C 2 $C_2$ are influenced by the scheduling variable a s , i $a_{s,i}$ since other terms are considered to be constant. Moreover, the term C 3 $C_3$ depends on both the scheduling variable a s , i $a_{s,i}$ and the DP noise standard deviations σ s , i $\sigma _{s,i}$ .

Furthermore, we observe that the upper bound converges only if C 1 < 1 $C_1<1$ . This is because, if we apply the upper bound in Theorem 1 recursively for t $t$ rounds, then we obtain a coefficient C 1 t $C_1^t$ as part of the final upper bound, which converges if C 1 < 1 $C_1<1$ . In Section 4, we design a scheduler and a DP optimizer based on these variables and their effect on this upper bound. To this end, we jointly minimize the values of C 1 $C_1$ and the total privacy leakage given by (12).

4 LEARNING OVER WIRELESS CHANNELS WITH INTER-CELL INTERFERENCE

Here, we consider other wireless parameters of the communication system and connect them to the notion of learning. These wireless parameters include resource allocation, transmit power consumption, fading channels, inter-cell interference, and communication rate.

We assume that the users apply an orthogonal frequency-division multiple access (OFDMA) technique in the uplink channel to transmit data to their corresponding base station [47]. Each edge device i U s $i\in \mathcal {U}_s$ is assigned a resource block indexed by n [ R ] $n\in [R]$ , where R $R$ is the total number of available uplink transmission resource blocks in each cell. Due to this reason, we assume that the intra-cell interference is mitigated by the scheduler.

The downside of allocating different frequency bands to users of the same cell is that not all edge devices can participate in the learning process. This is because the number of resource blocks and the available bandwidth are limited. However, we show here that we can achieve a sub-optimal learning result by selecting only those users that contribute the most to the learning process.

We define the uplink resource allocation matrix R s $\bm {R}_s$ in a given cell s $s$ as
R s : = r s , i ( 1 ) , r s , i ( 2 ) , , r s , i ( R ) i U s , $$\begin{align} \bm {R}_s:= {\left(r_{s,i}^{(1)}, r_{s,i}^{(2)}, \dots, r_{s,i}^{(R)}\right)}_{i\in \mathcal {U}_{s}}, \end{align}$$ (14)
where r s , i ( n ) { 0 , 1 } $r_{s,i}^{(n)}\in \lbrace 0,1\rbrace$ . Each row of this matrix represents the resource allocation for a user i U s $i\in \mathcal {U}_s$ . Here, r s , i ( n ) = 1 $r_{s,i}^{(n)}=1$ indicates that the edge device i U s $i\in \mathcal {U}_s$ uses resource block n $n$ in the uplink transmission and r s , i ( n ) = 0 $r_{s,i}^{(n)}=0$ , otherwise. Moreover, we assume that each active user ( a s , i = 1 $a_{s,i}=1$ ) is assigned only one resource block and inactive users ( a s , i = 0 $a_{s,i}=0$ ) are not assigned any resource block at all, that is,
n = 1 R r s , i ( n ) = a s , i , i U s , s S . $$\begin{align} \sum _{n=1}^{R} r_{s,i}^{(n)}=a_{s,i}, \forall i\in \mathcal {U}_s,s\in \mathcal {S}. \end{align}$$ (15)
In addition, edge devices in a given cell s $s$ do not interfere with each other, that is,
i U s r s , i ( n ) 1 , n [ R ] , s S . $$\begin{align} \sum _{i\in \mathcal {U}_s}r_{s,i}^{(n)} \le 1,\forall n\in [R],s\in \mathcal {S}. \end{align}$$ (16)
To be able to formulate the communication rate, we need first to define the transmit powers of the users. Let the uplink transmit power vector of all edge devices at a given cell s S $s\in \mathcal {S}$ be denoted by
p s : = ( p s , i ) i U s , $$\begin{align*} \bm {p}_s:= (p_{s,i})_{i\in \mathcal {U}_s}, \end{align*}$$
where p s , i $p_{s,i}$ denotes the transmit power of the user i U s $i\in \mathcal {U}_s$ . Moreover, the maximum transmit power of each user in any cell is denoted by P max $P_{\max }$ .
Another wireless parameter, which is of great importance in the considered system with multiple base stations, is the inter-cell interference. Let I s ( n ) ( s ) $I_{s}^{(n)}(\tilde{s})$ denote the interference signal power [48] from the cell s S { s } $\tilde{s}\in \mathcal {S}\backslash \lbrace s\rbrace$ that affects the uplink signal received by the base station s $s$ on the resource block n $n$ . Here, inequality (16) implies that I s ( n ) ( s ) $I_{s}^{(n)}(\tilde{s})$ is a factor of the transmit power of only one user in the cell s $\tilde{s}$ that transmits signals on the resource block n $n$ . In other words, the received interference signal power can be formulated as
I s ( n ) ( s ) = i U s h s , i r s , i ( n ) p s , i . $$\begin{align} I_{s}^{(n)}(\tilde{s}) = \sum _{i\in \mathcal {U}_{\tilde{s}}} h_{s,i} r_{\tilde{s},i}^{(n)}p_{\tilde{s},i}. \end{align}$$ (17)
The term h s , i $h_{s,i}$ in (17) is the channel gain between the user i U s $i\in \mathcal {U}_{\tilde{s}}$ and the base station s $s$ and can be computed by determining the path loss [49]. The channel gain is given by
h s , i = l s , i 2 c 4 π f c 2 1 d s , i 3 , $$\begin{align} h_{s,i} =l_{s,i}^2{\left(\frac{c}{4\pi f_{\mathrm{c}}}\right)}^2{\left(\frac{1}{d_{s,i}}\right)}^3, \end{align}$$ (18)
where f c $f_{\mathrm{c}}$ is the uplink center frequency, d s , i $d_{s,i}$ is the distance between the user i U s $i\in \mathcal {U}_{\tilde{s}}$ and base station s $s$ , l s , i $l_{s,i}$ is the output of a Rayleigh distribution with a unit scale parameter, and c $c$ is the speed of light.

Figure 2 illustrates an example of a wireless communication system with three base stations s , s $s,s^{\prime }$ , and s $s^{\prime \prime }$ in the uplink stage. Here, the received signals on the resource block n $n$ at the base station s $s^{\prime }$ are affected by the interference signal power I s ( n ) ( s ) $I_{s^{\prime }}^{(n)}(s)$ from cell s $s$ . Furthermore, base station s $s^{\prime \prime }$ is affected by the interference signal power I s ( n ) ( s ) $I_{s^{\prime \prime }}^{(n)}(s^{\prime })$ from cell s $s^{\prime }$ .

Details are in the caption following the image
Uplink stage of the wireless federated learning model with multiple base stations and inter-cell interference.
Let the uplink fading channel between each user i U s $i\in \mathcal {U}_s$ and its corresponding base station s $s$ be fixed and equal to h s , i $h_{s,i}$ . Also assume that the uplink bandwidth is denoted by B $B$ . Furthermore, all participants are assumed to have perfect channel knowledge. It is known (see, e.g. references [8, 48]) that the maximum uplink communication rate between each user i U s $i\in \mathcal {U}_s$ and its corresponding base station s $s$ can be formulated as
c s , i U : = n = 1 R r s , i ( n ) B log 1 + p s , i h s , i s S { s } I s ( n ) ( s ) + B N 0 , $$\begin{align} c_{s,i}^\mathrm{U}&:= \sum _{n=1}^{R}r_{s,i}^{(n)}B\log {\left(1+\frac{p_{s,i}h_{s,i}}{\sum _{\tilde{s}\in \mathcal {S}\backslash \lbrace s\rbrace }I_{s}^{(n)}(\tilde{s})+BN_0}\right)}, \end{align}$$ (19)
where N 0 $N_0$ is the thermal noise power spectral density. Furthermore, we denote the minimum required uplink communication rate between each user and its base station by a constant R min $R_{\min }$ .

We note that the thermal noise is added to the received signal, which is already channel encoded. On the contrary, the DP noise is added to the source information before any channel coding is performed. As a result, only the thermal noise appears in (19).

The parameters R s $\bm {R}_s$ and σ s $\bm{\sigma }_s$ play a critical role in improving the convergence rate of Algorithm 1 (cf. Theorem 1) and reducing the total privacy leakage. Furthermore, the parameter p s $\bm {p}_s$ is critical in establishing a reliable communication. By minimizing C 1 $C_1$ and C 2 $C_2$ with respect to these parameters, the upper bound on the optimality gap in Theorem 1 reduces and thus the convergence rate of the FL procedure should improve. To this end, it is sufficient to minimize only the expressions inside the squared term in C 1 $C_1$ .

Therefore, we propose an optimization problem over the variables ( R s , p s , σ s ) s S $(\bm {R}_s,\bm {p}_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ and minimize the values of C 1 $C_1$ and C 2 $C_2$ from Theorem 1 and the total privacy leakage given by (12). In this combined formulation, we assume that other FL parameters, such as L , μ , ξ 1 , ξ 2 , d $L,\mu,\xi _1,\xi _2,d$ , and T $T$ , are constant. The main server can then solve this optimization problem and then broadcast the results to all base stations before Algorithm 1 starts.

Since it is hard to directly solve a multi-objective optimization problem for both scheduling and total privacy leakage, we formulate the problem as a single-objective optimization task as follows:
minimize ( R s , p s , σ s ) s S s S i U s K s , i 1 n = 1 R r s , i ( n ) + γ s S i U s 1 K s , i σ s , i 2 n = 1 R r s , i ( n ) $$\begin{align} &\underset{(\bm {R}_s,\bm {p}_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}}{\text{minimize}} \quad \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i} {\left[1-\sum _{n=1}^{R} r_{s,i}^{(n)}\right]} \nonumber \\ & \qquad \qquad \qquad \qquad +\gamma \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}{\left(\frac{1}{K_{s,i}\sigma _{s,i}}\right)}^2\sum _{n=1}^{R} r_{s,i}^{(n)}\end{align}$$ (20)
subject to s S i U s K s , i σ s , i 2 n = 1 R r s , i ( n ) V max s S i U s K s , i n = 1 R r s , i ( n ) , $$\begin{align} & \text{subject to}\nonumber \\ & \sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}K_{s,i}\sigma _{s,i}^2\sum _{n=1}^{R} r_{s,i}^{(n)}\le V_{\max }\!\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i}\sum _{n=1}^{R} r_{s,i}^{(n)},\end{align}$$ (21)
K s , i σ s , i N min n = 1 R r s , i ( n ) , s S , i U s , $$\begin{align} & K_{s,i}\sigma _{s,i}\ge N_{\min }\sum _{n=1}^{R} r_{s,i}^{(n)},&& \forall s\in \mathcal {S},i\in \mathcal {U}_s,\end{align}$$ (22)
i U s r s , i ( n ) 1 , s S , n [ R ] , $$\begin{align} & \sum _{i\in \mathcal {U}_s}r_{s,i}^{(n)} \le 1,&& \forall s\in \mathcal {S}, n\in [R],\end{align}$$ (23)
n = 1 R r s , i ( n ) 1 and r s , i ( n ) { 0 , 1 } , s S , i U s , $$\begin{align} &\sum _{n=1}^{R} r_{s,i}^{(n)} \le 1\; \text{ and }\; r_{s,i}^{(n)}\in \lbrace 0,1\rbrace,&& \forall s\in \mathcal {S},i\in \mathcal {U}_s,\end{align}$$ (24)
0 p s , i P max , s S , i U s , $$\begin{align} & 0\le p_{s,i}\le P_{\max },&& \forall s\in \mathcal {S},i\in \mathcal {U}_s,\end{align}$$ (25)
c s , i U R min n = 1 R r s , i ( n ) , s S , i U s , $$\begin{align} & c_{s,i}^\mathrm{U}\ge R_{\min }\sum _{n=1}^{R} r_{s,i}^{(n)},&& \forall s\in \mathcal {S},i\in \mathcal {U}_s, \end{align}$$ (26)
where γ > 0 $\gamma >0$ is a constant and is used to balance the optimization of the scheduling and the total privacy leakage. Typically, the value of the constant γ $\gamma$ can be obtained by hyperparameter tuning and simulations. The choice of γ $\gamma$ depends on several factors, such as the sizes of the training data chunks, namely K s , i $K_{s,i}$ , and our emphasis on minimizing either the scheduling term or the total privacy leakage.

Minimizing the first term in the objective function in (20) improves the convergence of Algorithm 1 and is computed by applying (15) to the summation term in C 1 $C_1$ of Theorem 1. Minimizing the second term, on the other hand, reduces the total privacy leakage at all users and is given by (12).

In the optimization problem (20), edge devices who have a larger number of samples K s , i $K_{s,i}$ and a better uplink communication channel have generally a higher chance to be scheduled and assigned less DP noise power.

Constraint (21) guarantees that the DP noise error, which is characterized by the term C 3 $C_3$ of Theorem 1, is less than a given constant V max $V_{\max }$ . To derive (21), we first consider the following upper bound on the squared term in C 3 $C_3$
K s , i a s , i K a σ s , i 2 K s , i a s , i K a σ s , i 2 , $$\begin{align} {\left(\frac{K_{s,i}a_{s,i}}{K_{\mathrm{a}}}\sigma _{s,i}\right)}^2\le \frac{K_{s,i}a_{s,i}}{K_{\mathrm{a}}}\sigma _{s,i}^2, \end{align}$$ (27)
which follows by (10) and K s , i a s , i K a $K_{s,i}a_{s,i}\le K_{\mathrm{a}}$ . Constraint (21) then follows by applying (10) and (15) to this upper bound and setting it to be smaller than V max $V_{\max }$ . We can then control the amount of DP noise variance and its error by adjusting the constant V max $V_{\max }$ .

Conditions (23) and (24) provide the resource allocation constraints, whereas (25) and (26) restrict the transmit power to a maximum amount P max $P_{\mathrm{max}}$ and ensure a minimum communication rate R min $R_{\mathrm{min}}$ for each user in each cell, respectively. Finally, constraint (22) guarantees an upper bound on the privacy leakage of the users individually due to (11). Here, the constant N min $N_{\min }$ controls the minimum amount of DP noise at each user.

We notice that the variable a s $\bm {a}_s$ and R s $\bm {R}_s$ , which are given by (3) and (14), are related due to (15). Therefore, a s $\bm {a}_s$ does not appear as a minimization variable.

The optimization problem in (20) is not easy to solve. However, we can subdivide it into simpler problems and search for (sub-)optimal solutions. The main server can then compute and broadcast these (sub-)optimal ( R s , p s , σ s ) s S $(\bm {R}_s^*,\bm {p}_s^*,\bm{\sigma }_{s}^*)_{s\in \mathcal {S}}$ to all base stations where they can be forwarded to the users. These computations and initialization should be done prior to the beginning of Algorithm 1.

5 ALGORITHM DESIGN

Here, we propose two sub-optimal sequential algorithms to solve the optimization problem in (20). First, for fixed DP noise the objective function in (20) is minimized with respect to users' transmit powers and resource block allocation in a cell-by-cell manner. In the second part, with given transmit power and resource block allocation, the optimization problem in (20) becomes convex with respect to the DP noise standard deviations.

5.1 Optimal scheduler

Let the DP noise standard deviations be given such that (22) is always satisfied. In the following, we consider the joint transmit power and resource block allocation problem, which is a simplified version of (20).
minimize ( R s , p s ) s S s S i U s K s , i 1 n = 1 R r s , i ( n ) + γ s S i U s 1 K s , i σ s , i 2 n = 1 R r s , i ( n ) $$\begin{align} & \underset{(\bm {R}_s,\bm {p}_s)_{s\in \mathcal {S}}}{\text{minimize}} \quad \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i} {\left[1-\sum _{n=1}^{R}r_{s,i}^{(n)}\right]} \nonumber \\ & \qquad \qquad \quad \quad +\gamma \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}{\left(\frac{1}{K_{s,i}\sigma _{s,i}}\right)}^2\sum _{n=1}^{R} r_{s,i}^{(n)} \end{align}$$ (28)
subject to Equations (21) and (23) to (26).
The optimization problem in (28) is non-linear with respect to ( R s ) s S $(\bm {R}_s)_{s\in \mathcal {S}}$ due to (26) and (19). To further simplify it, we first compute the optimal transmit powers while guaranteeing the minimum communication rate constraint. Here, setting (26) to equality, combining it with (19), and using the fact that r s , i ( n ) { 0 , 1 } $r_{s,i}^{(n)}\in \lbrace 0,1\rbrace$ , the optimal transmit powers can be obtained as
p s , i = n = 1 R r s , i ( n ) 2 R min B 1 s S { s } I s ( n ) ( s ) + B N 0 h s , i . $$\begin{align} &p_{s,i}^*=\sum _{n=1}^R r_{s,i}^{(n)}{\left(2^{\frac{R_{\min }}{B}}-1\right)}\frac{\sum _{{\tilde{s}}\in \mathcal {S}\backslash \lbrace s\rbrace }I_{s}^{(n)}(\tilde{s})+BN_0}{h_{s,i}}. \end{align}$$ (29)
We then consider only one cell at each optimization step in an alternating strategy. Based on this approach and applying (29) to (25), the optimization task in (28) reduces to the following linear integer programming problem for a single cell s $s$ :
minimize R s i U s K s , i 1 n = 1 R r s , i ( n ) + γ i U s 1 K s , i σ s , i 2 n = 1 R r s , i ( n ) $$\begin{align} &\underset{\bm {R}_s}{\text{minimize}} \quad \sum _{i\in \mathcal {U}_s}K_{s,i} {\left[1-\sum _{n=1}^{R}r_{s,i}^{(n)}\right]}\nonumber \\ & \qquad \qquad \qquad \qquad \quad +\gamma \sum _{i\in \mathcal {U}_s}{\left(\frac{1}{K_{s,i}\sigma _{s,i}}\right)}^2\sum _{n=1}^{R} r_{s,i}^{(n)}\end{align}$$ (30)
subject to i U s K s , i σ s , i 2 n = 1 R r s , i ( n ) + s S { s } i U s K s , i σ s , i 2 n = 1 R r s , i ( n ) V max i U s K s , i n = 1 R r s , i ( n ) + V max s S { s } i U s K s , i n = 1 R r s , i ( n ) , $$\begin{align} & \text{subject to}\nonumber \\ &\sum _{i\in \mathcal {U}_s}K_{s,i}\sigma _{s,i}^2\sum _{n=1}^{R}r_{s,i}^{(n)}+\sum _{{\tilde{s}}\in \mathcal {S}\backslash \lbrace s\rbrace } \sum _{i\in \mathcal {U}_{\tilde{s}}}K_{\tilde{s},i}\sigma _{\tilde{s},i}^2\sum _{n=1}^{R}r_{\tilde{s},i}^{(n)}\nonumber \\ &\;\;\le V_{\max } \sum _{i\in \mathcal {U}_s}K_{s,i}\sum _{n=1}^{R}r_{s,i}^{(n)}+V_{\max }\!\!\!\sum _{{\tilde{s}}\in \mathcal {S}\backslash \lbrace s\rbrace } \sum _{i\in \mathcal {U}_{\tilde{s}}}K_{{\tilde{s}},i}\sum _{n=1}^{R}r_{{\tilde{s}},i}^{(n)},\end{align}$$ (31)
i U s r s , i ( n ) 1 , n [ R ] , $$\begin{align} & \sum _{i\in \mathcal {U}_s}r_{s,i}^{(n)} \le 1, \forall n\in [R],\end{align}$$ (32)
n = 1 R r s , i ( n ) 1 and r s , i ( n ) { 0 , 1 } , i U s , $$\begin{align} &\sum _{n=1}^{R} r_{s,i}^{(n)} \le 1\; \text{ and }\; r_{s,i}^{(n)}\in \lbrace 0,1\rbrace, \forall i\in \mathcal {U}_s,\end{align}$$ (33)
0 n = 1 R r s , i ( n ) 2 R min B 1 s S { s } I s ( n ) ( s ) + B N 0 h s , i P max , i U s . $$\begin{align} &0\le \sum _{n=1}^R r_{s,i}^{(n)}{\left(2^{\frac{R_{\min }}{B}}- 1\right)}\frac{\sum _{\tilde{s}\in \mathcal {S}\backslash \lbrace s\rbrace }I_{s}^{(n)}(\tilde{s})+BN_0}{h_{s,i}} \le P_{\max },\nonumber \\ &\forall i \in \mathcal {U}_s. \end{align}$$ (34)

To solve (30), we assume that ( R s , p s ) s S { s } $(\bm {R}_{\tilde{s}},\bm {p}_{\tilde{s}})_{\tilde{s}\in \mathcal {S}\backslash \lbrace s\rbrace }$ are known and satisfy (23)–(25). We then solve this problem with respect to R s $\bm {R}_s$ while taking R s $\bm {R}_{\tilde{s}}$ with s S { s } $\tilde{s}\in \mathcal {S}\backslash \lbrace s\rbrace$ as constants. By solving this optimization problem for each cell, we obtain a (sub-)optimal scheduling solution ( R s ) s S $(\bm {R}^*_s)_{s\in \mathcal {S}}$ for the whole system.

In the next step, the optimal transmit powers ( p s ) s S $(\bm {p}_s)_{s\in \mathcal {S}}$ should be accordingly computed by using (29). Yet, the term s S { s } I s ( n ) ( s ) $\sum _{\tilde{s}\in \mathcal {S}\backslash \lbrace s\rbrace }I_{s}^{(n)}(\tilde{s})$ in (29) is itself a linear function of transmit powers of other users due to (17). In fact, (29) can be written as a linear equation system Ap = b $\bm {Ap}=\bm {b}$ with unknown variables p $\bm {p}$ . Here, p $\bm {p}$ is a vector consisting of all transmit powers p s , i $p_{s,i}$ and A $\bm {A}$ and b $\bm {b}$ are the coefficients of the linear equation system given by (29). To compute the optimal transmit powers, we solve the following simple optimization:
minimize p Ap b 1 subject to 0 p s , i P max , s S , i U s . $$\begin{align} &\underset{\bm {p}}{\text{minimize}} \quad \Vert \bm {Ap}-\bm {b}\Vert _1\\ & \text{subject to}\quad 0\le p_{s,i}\le P_{\max },\quad \forall s\in \mathcal {S},i\in \mathcal {U}_s.\nonumber\end{align}$$ (35)

After finding the optimal powers from (35), we can compute the uplink communication rates by using (19). We then unschedule those users whose rates do not satisfy (26) and set their transmit power to zero.

ALGORITHM 2. Random scheduler with random differential privacy noise (RndSched).

1: Initialize the values of ( R s , p s , σ s ) s S $(\bm {R}_s,\bm {p}_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ randomly such that they satisfy (21)-(25).
2: Compute ( p s ) s S $(\bm {p}^*_s)_{s\in \mathcal {S}}$ by solving (35) and unschedule those users whose communication rates do not meet (26).
3: Output the resulting parameters as a (sub-)optimal solution ( R s , p s , σ s ) s S $(\bm {R}_s,\bm {p}^*_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ .

ALGORITHM 3. Optimal scheduler with random differential privacy noise (OptSched).

1: Initialize the values of ( R s , p s , σ s ) s S $(\bm {R}_s,\bm {p}_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ randomly such that they satisfy (21)-(25).
2: for s S $s\in \mathcal {S}$ do
3: For fixed ( R s , p s ) s S { s } $(\bm {R}_{\tilde{s}},\bm {p}_{\tilde{s}})_{{\tilde{s}}\in \mathcal {S}\backslash \lbrace s\rbrace }$ and ( σ s ) s S $(\bm{\sigma }_{s})_{s\in \mathcal {S}}$ , obtain a (sub-)optimal resource block allocation matrix R s $\bm {R}^*_s$ by solving the optimization problem in (30).
4: end for
5: Compute ( p s ) s S $(\bm {p}^*_s)_{s\in \mathcal {S}}$ by solving (35) and unschedule those users whose communication rates do not meet (26).
6: Output the resulting parameters as a (sub-)optimal solution ( R s , p s , σ s ) s S $(\bm {R}^*_s,\bm {p}^*_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ .

ALGORITHM 4. Optimal scheduler with differential privacy noise optimizer (OptSched+DP).

1: Perform Algorithm 3.
2: For the given ( R s , p s ) s S $(\bm {R}^*_s,\bm {p}^*_s)_{s\in \mathcal {S}}$ from Algorithm 3, obtain the optimal ( σ s ) s S $(\bm{\sigma }^*_{s})_{s\in \mathcal {S}}$ by solving (36).
3: Output the resulting parameters as a (sub-)optimal solution ( R s , p s , σ s ) s S $(\bm {R}^*_s,\bm {p}^*_s,\bm{\sigma }^*_{s})_{s\in \mathcal {S}}$ .

Based on these solutions, we propose two procedures for user scheduling and DP noise adjustment. Algorithm 2 presents a random scheduler (RndSched). Algorithm 3 provides an OptSched based on (30). Both algorithms benefit from the power allocation procedure based on (35) and both apply random DP noise to achieve privacy.

We note that one advantage of the OptSched is that it is linear and therefore efficient from a practical point of view compared with (28). Nevertheless, the drawback of this approach is that it is performed sequentially and cell by cell. As a result, there is no guarantee that this approach always provides us with an optimal solution. However, as we will see in Section 6.1, it delivers very good results compared with the randomized scheduler. In the next subsection, we extend this algorithm to include a DP optimizer.

5.2 DP optimizer

Let the transmit powers and resource block allocation values from the OptSched ( R s , p s ) s S $(\bm {R}^*_s,\bm {p}^*_s)_{s\in \mathcal {S}}$ be given. The DP noise optimization problem is then given by
minimize ( σ s ) s S s S i U s n = 1 R r s , i ( n ) K s , i 2 σ s , i 2 subject to ( 21 ) , ( 22 ) . $$\begin{align} & \underset{(\bm{\sigma }_{s})_{s\in \mathcal {S}}}{\text{minimize}} \quad \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}\frac{\sum _{n=1}^{R} r_{s,i}^{(n)}}{K_{s,i}^2\sigma _{s,i}^2} \\ & \text{subject to }(21), (22).\nonumber\end{align}$$ (36)
We note that the Hessian matrix of the objective function in (36) is given by
6 n = 1 R r 1 , 1 ( n ) K 1 , 1 2 σ 1 , 1 4 0 0 0 6 n = 1 R r 1 , 2 ( n ) K 1 , 2 2 σ 1 , 2 4 0 0 0 6 n = 1 R r S , U S ( n ) K S , U S 2 σ S , U S 4 , $$\begin{align} \def\eqcellsep{&}\begin{pmatrix} \frac{6\sum _{n=1}^{R}r_{1,1}^{(n)}}{K_{1,1}^2\sigma _{1,1}^4} & 0 & \cdots & 0\\ 0 & \frac{6\sum _{n=1}^{R}r_{1,2}^{(n)}}{K_{1,2}^2\sigma _{1,2}^4} & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \frac{6\sum _{n=1}^{R}r_{S,U_S}^{(n)}}{K_{S,U_S}^2\sigma _{S,U_S}^4} \end{pmatrix}, \end{align}$$ (37)
which is positive semi-definite in the feasible set. Furthermore, the Hessian matrix of constraint (21) is positive semi-definite as well. Therefore, the -optimization problem in (36) is convex and the global optimal solution can be obtained by solving the Karush-Kuhn–Tucker (KKT) [50] conditions. The Lagrange function can be formulated as:
L ( σ s ) s S = s S i U s n = 1 R r s , i ( n ) K s , i 2 σ s , i 2 + κ s S i U s K s , i σ s , i 2 n = 1 R r s , i ( n ) V max s S i U s K s , i n = 1 R r s , i ( n ) , $$\begin{align*} \mathcal {L}{\left((\bm{\sigma }_{s})_{s\in \mathcal {S}}\right)} &=\sum_{s\in \mathcal{S}}\sum_{i\in \mathcal{U}_s}\frac{\sum_{n=1}^{R} r_{s,i}^{(n)}}{K_{s,i}^2\sigma_{s,i}^2}\\ &\quad +\kappa {\left(\sum _{s\in \mathcal{S}} \sum _{i\in \mathcal{U}_s}K_{s,i}\sigma _{s,i}^2\sum _{n=1}^{R}r_{s,i}^{(n)}\right.}\\ &{\left.\qquad \qquad - V_{\max }\sum _{s\in \mathcal{S}} \sum _{i\in \mathcal{U}_s}K_{s,i}\sum _{n=1}^{R}r_{s,i}^{(n)}\right)}, \end{align*}$$
where κ 0 $\kappa \ge 0$ is a Lagrange multiplier.
Setting the derivative of L $\mathcal {L}$ with respect to σ s , i $\sigma _{s,i}$ to zero yields
2 n = 1 R r s , i ( n ) K s , i 2 σ s , i 3 + 2 κ K s , i n = 1 R r s , i ( n ) σ s , i = 0 . $$\begin{align} -\frac{2\sum _{n=1}^{R}r_{s,i}^{(n)}}{K_{s,i}^2\sigma _{s,i}^3}+2\kappa K_{s,i}\sum _{n=1}^{R}r_{s,i}^{(n)}\sigma _{s,i}=0. \end{align}$$ (38)
Let n = 1 R r s , i ( n ) = 1 $\sum _{n=1}^{R}r_{s,i}^{(n)}=1$ hold. It then follows by combing (38) with (22) that
σ s , i = K s , i 3 κ 1 4 N min K s , i , $$\begin{align} \sigma _{s,i}={\left.{\left(K_{s,i}^3 \kappa \right)}^{-\frac{1}{4}}\right|}_{ \frac{N_{\min }}{K_{s,i}} }, \end{align}$$ (39)
where a | b = max { a , b } , s = S $a|_b=\max \lbrace a,b\rbrace, s=\mathcal {S}$ , and i U s $i\in \mathcal {U}_s$ . If n = 1 R r s , i ( n ) = 0 $\sum _{n=1}^{R}r_{s,i}^{(n)}=0$ holds, then we have σ s , i = 0 $\sigma _{s,i}=0$ .
For the optimal solution, constraint (21) always holds with equality since the objective function monotonically decreases with increasing σ s , i $\sigma _{s,i}$ . Consequently, substituting (39) into (21) with equality implies that
s S i U s K s , i n = 1 R r s , i ( n ) K s , i 3 κ 1 2 N min 2 K s , i 2 = V max s S i U s K s , i n = 1 R r s , i ( n ) . $$\begin{align} \sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}K_{s,i}&\sum _{n=1}^{R}r_{s,i}^{(n)} {\left.{\left(K_{s,i}^3 \kappa \right)}^{-\frac{1}{2}}\right|}_{ \frac{N_{\min }^2}{K_{s,i}^2} } \nonumber \\ &=V_{\max }\sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}K_{s,i}\sum _{n=1}^{R}r_{s,i}^{(n)}. \end{align}$$ (40)

After the value of κ $\kappa$ is found from (40), the optimal σ s , i $\sigma ^*_{s,i}$ can be computed from (39). We then combine this scheme with the procedure in Section 5.1. A summary of this scheme is provided in Algorithm 4 (OptSched+DP).

6 SIMULATIONS AND NUMERICAL SOLUTIONS

6.1 Optimization problems

Here, we present the numerical solutions of the algorithms that were presented in Section 5. In this regard, we apply the Python optimization packages CVXPY, CVXOPT, GLPK, and ECOS [35-39].

Since Algorithms 3 and 4 are heuristic, their solutions depend on the initial values of the optimizing variables ( R s , p s , σ s ) s S $(\bm {R}_s,\bm {p}_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ as well as the wireless channels and the number of training samples at each user. As a result, we repeat the computations for several random initial values, channels, and data distributions among the users and then compute the average.

To this end, the variables ( R s ) s S $(\bm {R}_s)_{s\in \mathcal {S}}$ are first initialized based on a shuffled round-robin scheme and ( p s , σ s ) s S $(\bm {p}_s,\bm{\sigma }_{s})_{s\in \mathcal {S}}$ are set uniformly at random such that 0 p s , i P max $0\le p_{s,i}\le P_{\max }$ and N min / K s , i σ s , i 6 N min / K s , i $N_{\min }/K_{s,i}\le \sigma _{s,i}\le 6N_{\min }/K_{s,i}$ hold. Second, the users are positioned in a square area consisting of seven hexagon cells according to a uniform distribution. The edge devices are then assigned to their nearest base stations according to their random position. Based on their distances to the base stations, their fading channels are then computed by applying (18).

An example of channel initialization, which is generated by our simulator in Python language, is shown in Figure 3. Here, the channels between one of the users and base stations are depicted as dashed lines. We notice that the cells 1–6 in this setting can cover users also outside their area while the central cell only covers devices inside the central hexagon. As a result, the effects of boundary and central cells are both taken into account in our simulations.

Details are in the caption following the image
User and channel initialization with 100 users.

After the users are assigned to their corresponding base stations, the training data is randomly distributed among all users. Inspired by reference [51], the number of samples K s , i $K_{s,i}$ are determined by a lognormal distribution. Algorithms 3 and 4 should then provide us with (sub-)optimal values ( R s , p s ) s S $(\bm {R}^*_s,\bm {p}^*_s)_{s\in \mathcal {S}}$ and ( σ s ) s S $(\bm {\sigma }^*_s)_{s\in \mathcal {S}}$ , respectively.

The system parameters that are used in the computations are listed in Table 2. Figure 4 shows the results of all algorithms in the form of an empirical cumulative distribution function (CDF) of the normalized objective value in (20). The normalization is done by dividing the value of the objective function by the total number of samples (scheduled or unscheduled). The CDF is computed for two values of available number of resource blocks R $R$ and the optimization constant γ $\gamma$ from (20). The results are averaged over 10 3 $10^3$ random channels and initial values. As seen in Figure 4, the OptSched (Algorithm 3) outperforms the RndSched (Algorithm 2) in terms of minimizing the objective value in (20). Moreover, the OptSched+DP (Algorithm 4) further improves the results of the OptSched by reducing the total privacy leakage. Furthermore, the OptSched+DP achieves lower values for γ = 10 7 $\gamma =10^7$ compared with the case in which γ = 10 6 $\gamma =10^6$ . This is because, larger γ $\gamma$ in (20) gives more weight to the DP noise optimization.

TABLE 2. Simulation parameters.
System parameters Values
Number of cells or base stations ( S ) $(S)$ 7
Total number of users 100
Cell radius 500 m
Uplink center frequency ( f c $f_{\mathrm{c}}$ ) 2450 MHz
Channels' Rayleigh distribution scale parameter 1
Uplink resource block bandwidth (B) 180 k Hz $180 \mathrm{\text{k}Hz}$
Thermal noise power spectral density ( N 0 ) $(N_0)$ 17 dBm $-17 \text{dBm}$
Maximum transmit power ( P max ) $(P_{\mathrm{max}})$ 10 dBm $10 \text{dBm}$
Minimum communication rate ( R min ) $(R_{\mathrm{min}})$ 100 k bs $100 \mathrm{\text{k}bs}$
DP noise error upper bound ( V max ) $(V_{\mathrm{max}})$ 12
Minimum total DP noise at each user ( N min ) $(N_{\text{min}})$ 100
Details are in the caption following the image
Numerical results of the optimization problem in (20) for varying number of resource blocks and balancing constant (see 4a and 4b).

We also notice that by increasing R $R$ from 5 to 8, the normalized objective values of the RndSched get slightly closer to the outcome of the OptSched algorithm. This is due to the fact that by increasing R $R$ and keeping the total number of users constant, chances that all users are successfully scheduled by RndSched become higher. Here, for large values of R $R$ , the RndSched might eventually achieves the same performance as for OptSched. However, the choice of selecting a large number of resource blocks for a low number of users is not desirable due to the limited amount of available bandwidth.

6.2 Federated learning simulations

Here, we apply the random parameters a s $\bm {a}_s$ and σ s $\bm{\sigma }_{s}$ as well as (sub-)optimal a s $\bm {a}^*_s$ and σ s $\bm{\sigma }^*_{s}$ from Section 6.1 to an FL system as described in Algorithm 1. We assume that the main server and all users each maintain a fully connected neural network in the form of a multi-label classifier. The networks consist of two hidden layers, each with 256 nodes. To implement the simulations, we apply the TensorFlow, NumPy, and Matplotlib packages [40-42]. Furthermore, we use the Modified National Institute of Standards and Technology (MNIST) image dataset [52] to train and test the multi-label classifier.

We train the local models over T = 200 $T=200$ communication rounds between users and the main server. To follow our mathematical model in Section 2, we perform no local iterations and use the batch gradient descent scheme. We do not apply any decay and use a fixed learning rate λ = 0.05 $\lambda =0.05$ .

Furthermore, to guarantee that Assumption 1 holds, the gradient of all weights are clipped so that their global norm is smaller than or equal to M = 10 $M=10$ . This directly affects the amount of privacy leakage as given by (11).

We perform the simulations over 100 channels and initial values and then average the resulting accuracy and loss. Furthermore, we generate the empirical CDF of the privacy leakage of all users. Figure 5 shows the accuracy, loss, and privacy leakage CDF of this learning system for different values of available resource blocks R $R$ and optimization constant γ $\gamma$ .

Details are in the caption following the image
Federated learning system with a learning rate of λ = 0.05 $\lambda =0.05$ and a maximum gradient global norm of M = 10 $M=10$ for varying number of resource blocks and balancing constant. The average global accuracy is given in 5a-5b, the average global loss in 5c-5d, and the CDF of the privacy leakage in 5e-5f.

As seen in Figure 5, the OptSched outperforms the RndSched algorithm in terms of accuracy and loss for both R = 5 $R=5$ and R = 8 $R=8$ . Here, the OptSched systematically selects the users with large chunks of data that have a better channel and suffer less from the inter-cell interference. The RndSched algorithm, however, fails in this scenario since it applies a random scheduling scheme.

The OptSched+DP, on the other hand, slightly degrades the performance of the OptSched by increasing and optimizing the DP noise. Yet OptSched+DP provides a similar or even better performance compared with RndSched scheme for small values of R $R$ (see Figures 5(a) and 5(c)). The degradation is the price that is paid to improve the privacy. Figures 5(e) and 5(f) show the empirical CDF of the privacy leakage. Here, the value of ρ s , i $\rho _{s,i}$ at each user i U s $i\in \mathcal {U}_s$ and cell s S $s\in \mathcal {S}$ is computed by using (11) and the results over all simulation iterations are collected to compute the CDF. The simulations show that the OptSched+DP scheme substantially reduces the amount of privacy leakage at each user. In particular it achieves a maximum privacy leakage of around ρ = 0.5 $\rho =0.5$ thanks to the DP optimizer scheme. This is a significant improvement compared with the RndSched scheme with a maximum leakage of around ρ = 4 $\rho =4$ .

Adjusting the optimization constant γ $\gamma$ is also crucial. In this regard, by choosing γ = 10 7 $\gamma =10^7$ the OptSched achieves lower privacy leakage compared with γ = 10 6 $\gamma =10^6$ (see Figures 5 and 5). This is because larger γ $\gamma$ gives less weight to the OptSched in (20) and users with higher DP noise power are preferred in scheduling.

7 CONCLUSION

Here, a privacy preserving FL procedure in a multiple base stations scenario with inter-cell interference has been considered. An upper bound on the optimality gap of the convergence term of this learning scheme has been derived and an optimization problem to reduce this upper bound has been provided. We have proposed two sequential algorithms to obtain (sub-)optimal solutions for this optimization task; namely an OptSched in Algorithm 3 and its extended version with DP optimizer (OptSched+DP) in Algorithm 4. In designing these schemes we avoid non-linearity in the integer programming problems. The outputs of these algorithms are then applied to an FL system.

Simulation results have shown that the OptSched increases the accuracy of the classification FL system and reduces the loss compared with the RndSched when the number of available resource blocks R $R$ is small. Here, when the total number of users is K = 100 $K=100$ and R = 5 $R=5$ , the OptSched shows an accuracy improvement of over 6 % $6\%$ . Simulations have further shown that the OptSched not only improves the accuracy but also can reduce the privacy leakage compared with the RndSched if the parameter γ $\gamma$ is set properly.

The OptSched+DP, on the other hand, further optimizes the DP noise and substantially reduces the privacy leakage compared with both RndSched and OptSched. Here, simulations have shown that the OptSched+DP reduces the maximum privacy leakage for both R = 5 $R=5$ and R = 8 $R=8$ by a factor of 8 (from ρ = 4 $\rho =4$ to ρ = 0.5 $\rho =0.5$ ). It is worth mentioning that when R $R$ is small (e.g. R = 5 $R=5$ ), this improvement is achieved while OptSched+DP shows a similar or even better performance in terms of accuracy and loss compared with RndSched.

AUTHOR CONTRIBUTIONS

Nima Tavangaran: Conceptualization; formal analysis; methodology; project administration; software; visualization; writing—original draft. Mingzhe Chen: Conceptualization; formal analysis; methodology; resources; validation. Zhaohui Yang: Conceptualization; resources; validation. José Mairton B. Da Silva Jr.: Conceptualization; formal analysis; resources; software; validation. H. Vincent Poor: Funding acquisition; resources; supervision; validation.

ACKNOWLEDGEMENTS

The work of Nima Tavangaran was partly supported by the German Research Foundation (DFG) under Grant TA 1431/1-1. The work of Mingzhe Chen was partly supported by the U.S. National Science Foundation under grant CNS-2312139. The work of José Mairton B. Da Silva Jr. was jointly supported by the European Union's Horizon Europe research and innovation program under the Marie Skłodowska-Curie project FLASH, with grant agreement No. 101067652; the Ericsson Research Foundation; and the Hans Werthén Foundation. The work of H. Vincent Poor was supported by the U.S. National Science Foundation under grants CCF-1908308 and CNS-2128448.

    CONFLICT OF INTEREST STATEMENT

    The authors declare no conflicts of interest.

    APPENDIX

    Proof of Theorem 1

    Proof.It follows by using Assumption 5 and applying the Taylor expansion to the global loss function f $f$ that

    f ( w ( t + 1 ) ) f ( w ( t ) ) + w ( t + 1 ) w ( t ) f ( w ( t ) ) + L 2 w ( t + 1 ) w ( t ) 2 2 . $$\begin{align} f(\bm {w}^{(t+1)})\le f(\bm {w}^{(t)})&+{\left(\bm {w}^{(t+1)}-\bm {w}^{(t)}\right)}^{\intercal} \nabla f(\bm {w}^{(t)})\nonumber \\ &+\frac{L}{2} \big\Vert \bm {w}^{(t+1)}-\bm {w}^{(t)}\big\Vert_2^2. \end{align}$$ (A1)

    Next, we compute the local updates at each user i U s $i\in \mathcal {U}_s$ by combing (4), (5), and (7) as follows

    w s , i ( t + 1 ) = w ( t ) λ K s , i k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) λ n s , i ( t ) . $$\begin{align} \bm {w}_{s,i}^{(t+1)}=\bm {w}^{(t)}- \frac{\lambda }{K_{s,i}}\sum _{k=1}^{K_{s,i}}\nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})-\lambda \bm {n}^{(t)}_{s,i}. \end{align}$$ (A2)
    Furthermore, combining (8) and (9) implies that
    w ( t + 1 ) = 1 K a s S i U s K s , i a s , i w s , i ( t + 1 ) . $$\begin{align} \bm {w}^{(t+1)}=\frac{1}{K_{\mathrm{a}}}\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}\bm {w}_{s,i}^{(t+1)}. \end{align}$$ (A3)

    We then obtain the global update at the main server by inserting the value of w s , i ( t + 1 ) $\bm {w}_{s,i}^{(t+1)}$ from (A2) into (A3) as follows

    w ( t + 1 ) w ( t ) = λ K a s S i U s a s , i k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) λ K a s S i U s K s , i a s , i n s , i ( t ) . $$\begin{align} \bm {w}^{(t+1)}-\bm {w}^{(t)}=-\frac{\lambda }{K_{\mathrm{a}}}&\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}a_{s,i}\sum _{k=1}^{K_{s,i}}\nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})\nonumber \\ &-\frac{\lambda }{K_{\mathrm{a}}}\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}\bm {n}^{(t)}_{s,i}. \end{align}$$ (A4)

    To simplify the rest of calculations, we define a new random variable to reflect the difference between the global update and the global gradient as:

    Δ ( t ) : = f ( w ( t ) ) + 1 λ w ( t + 1 ) w ( t ) . $$\begin{align} \Delta ^{(t)}:= \nabla f(\bm {w}^{(t)}) +\frac{1}{\lambda }{\left(\bm {w}^{(t+1)}-\bm {w}^{(t)}\right)}. \end{align}$$ (A5)

    Now by inserting the term w ( t + 1 ) w ( t ) $\bm {w}^{(t+1)}-\bm {w}^{(t)}$ (the global update) from (A5) into (A1), we have that

    f ( w ( t + 1 ) ) f ( w ( t ) ) + λ Δ ( t ) f ( w ( t ) ) f ( w ( t ) ) + λ 2 L 2 Δ ( t ) f ( w ( t ) ) 2 2 . $$\begin{align} f(\bm {w}^{(t+1)})\le f(\bm {w}^{(t)})&+\lambda {\left(\Delta ^{(t)}-\nabla f(\bm {w}^{(t)})\right)}^\intercal \nabla f(\bm {w}^{(t)})\nonumber \\ &+\frac{\lambda ^2 L}{2}\big \Vert \Delta ^{(t)}-\nabla f(\bm {w}^{(t)})\big \Vert _2^2. \end{align}$$ (A6)

    Furthermore, the following identity always holds:

    u v 2 2 = u 2 2 + v 2 2 2 u v . $$\begin{align} \bm {\Vert \bm {u}-\bm {v}\Vert _2^2}=\Vert \bm {u}\Vert _2^2+\Vert \bm {v}\Vert _2^2-2 \bm {u}^\intercal \bm {v}. \end{align}$$ (A7)

    Considering the learning step size to be λ = 1 / L $\lambda =1/L$ and applying the identity (A7) to (A6), it follows that

    f ( w ( t + 1 ) ) f ( w ) f ( w ( t ) ) f ( w ) + 1 2 L Δ ( t ) 2 2 f ( w ( t ) ) 2 2 , $$\begin{align} f(\bm {w}^{(t+1)}) - f(\bm {w}^*)&\le f(\bm {w}^{(t)})-f(\bm {w}^*)\nonumber \\ &+\frac{1}{2L}{\left[\Vert \Delta ^{(t)}\Vert _2^2-\Vert \nabla f(\bm {w}^{(t)})\Vert _2^2\right]}, \end{align}$$ (A8)
    where f ( w ) $f(\bm {w}^*)$ is the optimal loss function (Assumption 2).

    Inspired by reference [8], we first obtain an upper bound on the expectation of the term Δ ( t ) 2 2 $\Vert \Delta ^{(t)}\Vert _2^2$ on the right-hand side of (A8). We have by combining (A4) and (A5) that

    E Δ ( t ) 2 2 = E f ( w ( t ) ) 1 K a s S i U s a s , i k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) 1 K a s S i U s K s , i a s , i n s , i ( t ) 2 2 = E K K a K K a ( s , i ) : a s , i = 1 k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) + 1 K ( s , i ) : a s , i = 0 k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) 2 2 + E 1 K a s S i U s K s , i a s , i n s , i ( t ) 2 2 , $$\begin{align} &\mathbb {E}{\left[\Vert \Delta ^{(t)}\Vert _2^2\right]}\nonumber \\ &=\mathbb {E}\left[ \bigg \Vert \nabla f(\bm {w}^{(t)})-\frac{1}{K_{\mathrm{a}}} \sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}a_{s,i}\sum _{k=1}^{K_{s,i}}\nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})\right.\nonumber \\ &\left.\qquad \qquad \qquad -\frac{1}{K_{\mathrm{a}}}\sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}\bm {n}^{(t)}_{s,i}\bigg \Vert _2^2 \right]\nonumber \\ &= \mathbb {E}\left[\bigg \Vert -\frac{K-K_{\mathrm{a}}}{K K_{\mathrm{a}}}\sum _{\substack{(s,i):\\ a_{s,i}=1}}\sum _{k=1}^{K_{s,i}} \nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})\right.\nonumber \\ &\left.\qquad \qquad \qquad +\frac{1}{K}\sum _{\substack{(s,i):\\ a_{s,i}=0}}\sum _{k=1}^{K_{s,i}} \nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})\bigg \Vert _2^2 \right]\nonumber \\ &\qquad \qquad \qquad +\mathbb {E}{\left[\bigg \Vert \frac{1}{K_{\mathrm{a}}}\sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}K_{s,i}a_{s,i}\bm {n}^{(t)}_{s,i}\bigg \Vert _2^2 \right]}, \end{align}$$ (A9)
    where (A9) follows by applying (13) and (A7) and the fact that the DP noise is independent of other random variables and E [ n s , i ( t ) ] = 0 $\mathbb {E}[\bm {n}^{(t)}_{s,i}]=\bm {0}$ . It follows that
    E Δ ( t ) 2 2 E K K a K K a ( s , i ) : a s , i = 1 k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) 2 + 1 K ( s , i ) : a s , i = 0 k = 1 K s , i l ( w ( t ) , x s , i ( k ) ) 2 2 + e = 1 d E s S i U s K s , i a s , i K a n s , i , e ( t ) 2 , $$\begin{align} \mathbb {E}{\left[\Vert \Delta ^{(t)}\Vert _2^2\right]} &\le \mathbb {E}{\left[{\left(\frac{K-K_{\mathrm{a}}}{K K_{\mathrm{a}}}\sum _{\substack{(s,i):\\ a_{s,i}=1}}\sum _{k=1}^{K_{s,i}}\Vert \nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})\Vert _2\right.}\right.}\nonumber \\ &{\left.{\left.\quad +\frac{1}{K}\sum _{\substack{(s,i):\\ a_{s,i}=0}}\sum _{k=1}^{K_{s,i}}\Vert \nabla l(\bm {w}^{(t)}, \bm {x}_{s,i}^{(k)})\Vert _2\right)}^2 \right]}\nonumber \\ &\quad +\sum _{e=1}^{d}\mathbb {E}{\left[{\left(\sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}\frac{K_{s,i}a_{s,i}}{K_{\mathrm{a}}}n_{s,i,e}^{(t)}\right)}^2\right]}, \end{align}$$ (A10)
    where, inequality (A10) is due to the triangle inequality and the fact that the vectors n s , i = ( n s , i , e ( t ) ) e [ d ] $\bm {n}_{s,i}=(n_{s,i,e}^{(t)})_{e\in [d]}$ are d $d$ -dimensional.

    Next, by applying Assumption 6 to (A10) we have for some ξ 1 0 $\xi _1\ge 0$ and ξ 2 1 $\xi _2\ge 1$ that

    E Δ ( t ) 2 2 s S i U s 2 K s , i K 1 a s , i 2 E ξ 1 + ξ 2 f ( w ( t ) ) 2 2 + d s S i U s K s , i a s , i K a σ s , i 2 , $$\begin{align} &\mathbb {E}{\left[\Vert \Delta ^{(t)}\Vert _2^2\right]}\nonumber \\ &\le {\left(\sum _{s\in \mathcal {S}}\sum _{i\in \mathcal {U}_s}\frac{2K_{s,i}}{K} {\left(1-a_{s,i}\right)}\right)}^2\mathbb {E}{\left[\xi _1+\xi _2\Vert \nabla f(\bm {w}^{(t)})\Vert _2^2\right]}\nonumber \\ &\qquad \qquad \qquad +d\sum _{s\in \mathcal {S}} \sum _{i\in \mathcal {U}_s}{\left(\frac{K_{s,i}a_{s,i}}{K_{\mathrm{a}}}\sigma _{s,i}\right)}^2, \end{align}$$ (A11)
    where the last term in (A11) is obtained due to the fact that the random variables n s , i , e ( j ) $n_{s,i,e}^{(j)}$ in (A10) are independent of each other.

    On the other hand, since f $f$ is both uniformly L $L$ -Lipschitz and μ $\mu$ -strongly convex (Assumptions 3 and 4) we have [46] that

    f ( w ( t ) ) 2 2 2 μ f ( w ( t ) ) f ( w ) , $$\begin{align} \Vert \nabla f(\bm {w}^{(t)})\Vert _2^2 &\ge 2\mu {\left[f(\bm {w}^{(t)})-f(\bm {w}^*)\right]},\end{align}$$ (A12)
    f ( w ( t ) ) 2 2 2 L f ( w ( t ) ) f ( w ) . $$\begin{align} \Vert \nabla f(\bm {w}^{(t)})\Vert _2^2 &\le 2L {\left[f(\bm {w}^{(t)})-f(\bm {w}^*)\right]}. \end{align}$$ (A13)

    Next, we insert (A11) in (A8). The proof then follows by using (A12) and (A13) and the fact that μ L $\mu \le L$ (Assumption 5). $\Box$

    DATA AVAILABILITY STATEMENT

    The dataset used in the findings of this study are available in: Y. LeCun, ‘The MNIST database of handwritten digits’, 1998, http://yann.lecun.com/exdb/mnist

    • 1 For the sake of simplicity, we consider only a single local update at each edge device in each round and apply the batch gradient descent.