Federated learning framework for mobile edge computing networks

: The continuous growth of smart devices needing processing has led to moving storage and computation from cloud to the network edges, giving rise to the edge computing paradigm. Owing to the limited capacity of edge computing nodes, the presence of popular applications in the edge nodes results in significant improvements in users ’ satisfaction and service accomplishment. However, the high variability in the content requests makes prediction demand not trivial and, typically, the majority of the classical prediction approaches require the gathering of personal users ’ information at a central unit, giving rise to many users ’ privacy issues. In this context, federated learning gained attention as a solution to perform learning procedures from data disseminated across multiple users, keeping the sensitive data protected. This study applies federated learning to the demand prediction problem, to accurately forecast the more popular application types in the network. The proposed framework reaches high accuracy levels on the predicted applications demand, aggregating in a global and weighted model the feedback received by users, after their local training. The validity of the proposed approach is verified by performing a virtual machine replica copies and comparison with the alternative forecasting approach based on chaos theory and deep learning.


Introduction
Recently, the ever increasing dissemination in our daily life of intelligent devices such as wearable devices, smartphones, smart cards, sensors, and so on, has triggered the proliferation of numerous distributed network devices generating massive quantity of heterogeneous data to be processed and interpreted [1,2]. Owing to such unprecedented amount of data with exponential growth trend [3] and the typical private nature of these data, sending all the data to a remote cloud becomes impracticable, unnecessary, and full of privacy concerns [2]. Therefore, all of these factors have contributed to the emergence of the new mobile edge computing (MEC) paradigm [4][5][6], which exploits the advancement in the storage and computation capacity of modern devices for pushing processing and storing procedures locally on the devices themselves. The MEC approach involves the cooperation of edge-nodes with the remote cloud in order to give rise to a computing system able to support large-scale task processing and managing the environment [1]. Within this context, the efficient and effective handling of big data brings out information and statistical features hidden in the datasets, useful for many application areas such as resource planning, system conditions forecasting, classification, and so on. In this regard, machine learning (ML) [7] techniques have gained momentum to properly catch and interpret data behaviour, by providing a wide range of solutions to analyse datasets trend on the cloud site. Despite the dataset characteristics represent an invaluable source of information to be properly exploited, from the other side the manipulation of users' sensitive data implies significant responsibilities and risks in keeping them in a centralised site such as the cloud [8]. In order to manipulate big data with respect to users' privacy, the federated learning (FL) [1,2,8,9] approach has emerged as a set of ML techniques to perform statistical and mathematical training models directly on devices. The FL framework involves ML models locally trained at the devices level, hereafter named as clients, and then the aggregation of these results in a central server, e.g. a base station.
Generally speaking, FL has countless issues to deal with, among which [10] † Non-independent identically distributed data. The clients training datasets are different from each other and a given local training dataset is not representative of the population distribution. † Unbalanced datasets. The amount of local training data is different for each client. This implies different reliability on trained values for different clients, since too short training procedures may occur. † Large-scale distribution. The clients involved in the FL framework are significantly greater than the number of data locally processed. † Limited communication. Mobile devices are not always available to train data and often they may be slow or with poor communication conditions.
The limited communication capacity is out of the objective of this study, which considers a hybrid cloud-MEC network scenario whereby the deployment of virtual machine replica copies (VRCs) of most requested applications on network elements (NEs), located in the network edges, close to the edge devices (EDs) has been taken into account. Therefore, this study focuses on a hybrid cloud-MEC architecture and addresses the design and implementation of the FL framework to properly predict the users' applications demand in order to perform a proper VRCs allocation in terms of applications hit percentage (AHP). The AHP expresses the percentage of hits in finding the application requested by a device on a NE in its proximity.
The main contribution of this study is † The detection and application of basic methods to perform decentralised data training without draining the hardware resources of the EDs. † The extensive numerical simulations and comparison with the chaos theory (CT) approach, performed to validate the remarkable behaviour resulting from applying the proposed approach to the VRCs' deployment problem.
The rest of the paper is organised as follows. Section 2 proposes an in-depth review of the related literature. In Section 3, we propose the problem statement, and in Section 4, the FL framework is presented. In Section 5, the experimental results are shown. Finally, the conclusions are drawn in Section 6.

Related works
ML techniques constitute a wide branch of big data manipulation literature in MEC networks. Subramaniam and Kaur [11] investigated the application of various ML techniques in order to report the impact of different ML methods on the MEC network. Furthermore, Subramaniam and Kaur [11] analysed the effectiveness of the ML algorithms to detect the presence of malicious attacks in a MEC network. Yu et al. [ 12] proposed a deep supervised learning method aiming at minimising the overall network cost in performing computational offloading. Differently, a MEC blockchain network has been studied in [13], in which an auction solution based on deep learning is formulated to perform edge resource allocation in order to maximise the edge computing service provider profit. More in depth, Luong et al. [ 13] built a multi-layer neural network based on the optimal auction solution. In [14], a multi-hidden multi-layer convolutional neural network is adopted to perform data authentication in a robust mobile crowd sensing problem, aiming at improving sensing reliability and reducing the overall latency. By taking into account a real-time industrial application environment, Sangaiah et al. [ 15] address the position-based confidentiality problem in MEC systems by exploiting the k-nearest neighbour and the decision tree approaches. Chang et al. [ 16] examined the main classes of ML solutions to measure the benefits derived from edge-caching mechanisms, especially in terms of user satisfaction and energy efficiency evaluation.
In contrast, distributed ML is adopted in [17][18][19][20]. Tuor et al. [17] used a distributed version of the support vector machine method within an internet of things (IoT) context to evaluate system performance implementing the distributed ML. The distributed stochastic variance reduced gradient is applied in [18], in which the authors aim at optimising the number of collection points to perform data analysis, considering a fixed target accuracy, in order to minimise the amount of network traffic spent to send all the data towards the collection points. In [19], the crowd-sensing problem in an edge-computing scenario is treated by proposing a distributed deep learning approach, in order to lower the traffic congestion in the cloud site and balance the traffic. In particular, Li et al. [ 19] involve the human in the loop methodology to give a hierarchical structure to the crowd-sensing problem, aiming at controlling the whole crowd-sensing process. Furthermore, the distributed Q-learning algorithm is applied in [20], where the minimisation of the users' outage is performed by users themselves, selecting the most critical cell on which run the minimisation and considering a heterogeneous networks' context.
Finally, recently, FL has gained momentum and the authors of [2,8,[21][22][23][24] provided the main examples of such a branch of literature. Yang et al. [ 21] proposed a novel aggregation data framework for the over the air computation, by exploiting the signal superposition property of wireless channels. The aim of the authors of [2,21] is the maximisation of the number of devices involved in the aggregation process, by minimising the aggregation error. Wang et al. [ 2] adopted FL in a MEC system, in which the distributed gradient descent method is applied to determine the best trade-off between local updates and global aggregations, taking into account the minimisation of the loss function subjected to some resource constraints. In the same way, Yu et al. [ 22] consider as a case study the MEC environment, by proposing the application of the hybrid filtering on stacked encoders to predict the fluctuation of files' popularity in the contents caching problem. Then, McMahan et al. [ 8] combined the federated averaging algorithm proposed with the stochastic gradient descent algorithm, in order to train data in a distributed fashion avoiding high level of communication costs. Smith et al. [1] address the multi-task learning problem by resorting to the FL framework based on the novel Mocha context aware optimisation algorithm. A block-chained FL architecture is designed in [23], on the basis of which a distributed consensus strategy is provided, by analysing the block-chain end-to-end delay. Finally, FL is proposed in [24] to face the optimisation of the transmission and computation costs in a mixed IoT-MEC network, throughout the application of the multiple deep reinforcements learning agents. With regard to the general problem of the data forecasting an artificial neural network strategy is proposed in [25], which focuses on a multilayer perceptron neural network to forecast the lightning occurrences. Xu et al. [ 26] iteratively applied recurrent neural networks in order to forecast in real-time the taxi demand in the city of New York. Differently, Chandramitasari et al. [ 27] addressed the electricity consumption forecasting problem by combining both the feed forward neural network and the long short-term memory strategies. Furthermore, Li et al. [ 28] focused on the urban traffic passengers flows throughout the spatio-temporal extrapolation of the information regarding the analysed samples series by applying convolutional neural networks and a graph representing the traffic data. Then, the bicycle sharing demand in the city of Beijing has been provided in [29], by proposing an improved version of the Xgboost method combined with the sliding window. The VRC placement has been addressed in [8], considering a MEC environment with heterogeneous applications latency requirements, aiming at minimising the hardware resources consumption for the deployment of the VRCs. Li et al. [ 19] focused on the optimal virtual machine (VM) replica placement problem for the minimisation of the average system response time and the service provision costs in a Cloud-edge computing (Cloud-EC) scenario. Furthermore, the objective of [30] is the minimisation of the system power consumption, considering the VRCs problem to improve the behaviour of the system in the event of failure. Differently, Zhao et al. [ 31] proposed a placement strategy based on the divide-and-conquer approach to obtain a near optimal solution to minimise the data traffic for distributing the replicas on the infrastructure.
Similarly to [2,8,[21][22][23][24], our study proposes the application of the FL by using straightforward methods belonging to the gradient descent algorithms family. This conservative choice is due to the fact that more complex methods may result in prohibitive consumption of the ED resources, which represents a crucial point in the decentralised data training research field. Different from previous literature, this study focuses on the application of the FL framework to the VRCs deployment problem, by exploiting FL to predict the individual ED demand in order to perform proper VRCs planning. Furthermore, for the best of our knowledge, this is the first paper to contextualise FL to the VRCs allocation problem. Finally, the goodness of the proposed approach has been tested by resorting to extensive numerical simulation and by comparison with other predictive disciplines.

Reference scenario
The reference system scenario consists of the cloud network architecture mixed with MEC as depicted in Fig. 1, where the cloud is located in the remote area of the network, and there is a set of NEs N={1, ..., i, ..., n} situated close to the EDs, hereafter represented by the set D={1, ..., j, ..., m}. Each ED requires computation of one and only one task belonging to the set T , for which we have that both D and T have the same number of elements. Each NE is equipped with a central processing unit (CPU), homogeneous in frequency for all the NEs. Differently, we supposed cloud equipped with a higher CPU frequency. Then, each task, in order to be computed, requires a specific application that has to be installed in advance on the computation site. In this regard, loading the applications on NEs requires the presence of available storage resource blocks (SRBs), since each application needs a fixed number of SRBs. Accordingly, each NE disposes of a number of s i SRBs.
Each ED requiring task computation primarily looks for one VRC of the required application on a close NEs. (We have assumed that each ED sends the task to the nearest NE which, if it does not contain the application requested by the ED, forwards the task to the nearest NE containing that application. Finally, we have assumed that each NE has knowledge about the VRCs contained by other NEs and that each NE has stored a routing table in which, for each pair of NEs, the shortest path between those NEs is saved.) In the event that no one NE owns the VRC of the required application, the task is sent to the cloud on which all the application types are present. Furthermore, we considered negligible the transmission cost among the NEs, while we assumed a fixed data rate for the wireless link between the EDs and their nearest NE. Hence, the overall computation cost (OCC) experienced by ED j in performing computation is given by where t j,i and t j,C represent the time spent by task j on NE i and on the cloud, respectively. It is important to note that both t j,i and t j,C are expressed as the sum of the task execution time spent in the CPU of the NE and the cloud, and the queuing time experienced by the task waiting for its execution on these sites. (The CPU queue has been assumed with the first-in-first-out service policy.) Moreover, c j,i * represents the transmission cost in sending the task from ED j to its nearest NE i * . Since the transmission time among NEs and between NEs and the cloud has been supposed negligible, we have taken into account only the c j,i * cost. Finally, x j,i is a binary value equals to 1 if the task j is computed on NE i, 0 otherwise. It is important to highlight that the OCC in (1) strongly depends on the queuing time experienced by task on the designated computation site. In fact, proper deployment of VRCs on the NE network may drastically reduce the OCC task time.

Problem formulation
The main objective of this study is the maximisation of the hit rate in finding the VRCs of the requested applications on the NEs. This metric evaluation is due to the fact that the deployment of VRCs, provided on the basis of the forecast ED applications demand, lowers the tasks of OCC. Therefore, in formal terms, the AHP can be expressed as where H(X ) is the function which, given the VRCs allocation matrix X , whose generic element is x j,i , associates the corresponding number of hits occurred in finding the VRCs loaded on the NEs. Furthermore, the main goal of this study is given by max X P(X ).
where S expresses the maximum number of SRBs available on an NE. Hence, constraint (4) expresses that each NE i has a maximum limitation on the number of SRBs. It is important to note that P(X ) depends on the allocation matrix X , hence on the deployment of the VRCs on the NEs.

Learning problem
Generally speaking, ML aims at learning parameter models on the basis of some training data. In this regard, an ML model is typically characterised by a loss function depending on the data sample z and a parameter vector w, i.e. f z (w), which catches the error introduced by the model in relation to the training data [2]. By assuming the presence of m EDs, each of which has local data D j , j = 1, ..., m, the collective loss function can be expressed as [2,8] where |D j | indicates the cardinality of D j , i.e. the number of elements in D j . Similarly, from (5) follows that the global loss function computed on all the distributed local dataset D j is given by [2,8] As well explained in [2], the direct consequence from (5) and (6), is searching w w such that Therefore, as several previous state-of-the-art works [2,8], in order to optimise (7) with low-computational complexity, the gradient descent method is applied.

FL scheme
As depicted in Fig. 2, the proposed FL framework is composed by the clients' side, responsible for the local data training, and the server side, typically a base station, represented by a central server intended for improving the global learning model, through the merging and aggregation of EDs' updated local models. This approach is based on the interaction process between the clients and server sides, and, during each algorithm iteration round u, the EDs involved in the training procedure is a subset of the whole EDs set, whose number of elements is equal to y. The algorithm acts as follows: † In parallel, each ED j among the y EDs involved in the training procedure, updates its local parameter vector w j (u), implicitly built where a is the learning rate andŵ j (u − 1) represents the term w j (u − 1) after global aggregation. † The server side provides the weighted average as proposed in [8] and expressed by Distributed data training performing the algorithm previously presented implies several advantages in terms of clients' privacy, preservation of the computational resources of EDs, and messages exchange. In fact, the data training provided locally on the clients' site, allows users to keep protected their sensitive information. In addition, roughly speaking, the uploading of the ED j parameter vector w j does not expose the client to any sort of privacy issue, since, given w j , to retrieve D j is not trivial.
Furthermore, for each algorithm iteration round, the involvement of only a part of the EDs set ensures to reduce the message passing between clients and server sides. Finally, last but not least, it is important to highlight that the usage of the gradient descent algorithm performs optimisation without implying excessive resource consumption, by taking into account the EDs perspective.
The pseudo-code corresponding to the clients and server sides is reported in Algorithms 1 and 2 (see Figs. 3 and 4), respectively.

VRCs allocation
As previously detailed, this study uses FL to provide an accurate distributed prediction on the future EDs applications demand by considering the historical EDs application requests mitigated by the correlation with the other EDs involved in the training procedure. Once the prediction has been achieved, the main goal of this study is the exploitation of this information in order to provide a proper VRCs allocation, in terms of the AHP metric.
Given the predicted application demand, practically expressed in terms of predicted application popularity, the VRCs allocation strategy, given the applications popularity vector p sorted in descending order, consists of the following steps: 1. Start with the NEs empty, hence, with all the SRBs available. 2. Deploy one VRC of each application, starting by that most requested in p, on all the NEs in the network with a number of available SRBs able to host the considered application. 3. If any NE cannot host the considered application, the VRCs allocation algorithm terminates, otherwise. 4. Consider the next application in p and repeat steps 2-4 until there exists at least one NE able to host the processed application.
A very relevant point is that the proposed VRCs allocation algorithm provides an unbalanced VRCs deployment, according to the different popularity levels of each application. In particular, step 3 guarantees that the outcome VRCs allocation is not a uniform distribution of the applications. In fact, such a point ensures that no VRCs allocation is provided if the most requested application cannot be stored.
Furthermore, as regards the EDs computation strategy, as previously anticipated, each ED j requesting task computation with application r, sends a task to its nearest NE, i.e. i * , which acts in the following three ways: † The NE i * computes the task if it contains at least one VRC of the application r. † The NE i * does not contain any application r VRC, but at least one NE, i.e. v, has loaded at least one application r VRC. Therefore, i * forwards the task to the NE v. † Any NE contains one application r VRC and i * forwards the task to the far cloud.  The pseudo-code corresponding to the VRCs allocation procedure is reported in Algorithm 3 (see Fig. 5), while Algorithm 4 (see Fig. 6) shows the execution flow of the procedure for the task selection computation site.

Numerical results
In order to extensively test the proposed FL-based framework, we have resorted to numerical simulations in TensorFlow. We have considered a simulation scenario constituted by N=6 NEs, equipped with a CPU frequency equal to 2.4 GHz, and considering a number of uniformly distributed SRBs s i [ [50, 300]. We supposed that each application occupies a number of SRBs uniformly distributed within the interval [15,40]. Differently, the cloud has been modelled by supposing a CPU frequency equals to 4.6 GHz. The application requests have been simulated as in [22,32,33], by using MovieLens 1M dataset [34], hereafter referred to as Dataset 1, and MovieLens 100K dataset [34], named in this paper Dataset 2. Each task has been supposed composed by a number of 64 bits format instructions uniformly distributed within the integer interval [250, 800], requiring 8 CPU cycles per instruction. Furthermore, we assumed the connection link between the EDs and the nearest set equals to 100 Mbit/s. As regards the loss function, we used the mean squared error (MSE) which, for each data q l in D j ,i sd e fined as where M represents the number of samples in test data. Then, in order to test the effectiveness of the proposed approach, we compared the accuracy of the predicted values obtained with the CT approach, through the phase space reconstruction method [35,36], and with the deep learning-based approach proposed in [27]. The corresponding values adopted for the phase space reconstruction procedure are reported in Table 1. The CT approach has been performed by applying the k-neighbours method widely discussed in [37]. The higher accuracy levels in the prediction procedure are clearly evident in Figs. 7 and 8, which depict the MSE behaviour by increasing the time prediction horizon. In Figs. 7 and 8, the accuracy metric trend is showed by varying the prediction horizons in order to test the different approaches. In fact, the      accuracy analysis has been performed by taking into account prediction horizons greater than that on which the VRCs placement is actually pursuit, i.e. one hour. As it is straightforward to note, the MSE values grow by increasing the prediction horizon. This is due to the intrinsic difficulty in predicting the series of long-term behaviour.
Figs. 9 and 10 highlight system performance by considering the AHP by varying the algorithm communication rounds, for different number of EDs involved in the learning process. Both the figures confirm that greater is the number of EDs participating in the learning process, higher is the number of EDs involved in the process means more significant and accurate information on which the VRCs allocation strategy can act. Furthermore, by increasing the number of algorithm rounds, by which models are updated, the AHP reaches higher values, by improving system performance.
Finally, Figs. 11 and 12 make evident the system improvement on mean task OCC reached by involving a high number of clients in the learning process. These results confirm those previous exposed in Figs. 9 and 10 and, similarly, better performance, i.e. low values of mean OCC are obtained by increasing communication rounds to the considered two-sided framework. All these results validate the goodness of the proposed approach for the VRCs allocation problem and highlight the strict correlation between a valuable prediction model and remarkable system performance. Finally, the resulting system performance makes clear the suitability of FL to our problem.

Conclusions
This study addressed the VRCs allocation problem in a hybrid cloud-MEC network, aiming at maximising the AHP probability, i.e. the odds of finding a VRC of the application requested by the EDs on a NE in the edge of the network, instead of in the cloud, typically located in the remote area of the network. The problem has been addressed by applying the FL framework with the gradient descent algorithms family to avoid the excessive exploitation of the EDs hardware resource such as battery lifetime or computational components. Finally, the validity of the proposed framework has been shown throughout extensive empirical evaluation of system performance in comparison with the CT-based predictive approach.

Acknowledgments
This work was partially supported by the Project 'GAUChO-A Green Adaptive Fog Computing and Networking Architecture' funded by the MIUR Progetti di Ricerca di Rilevante Interesse Nazionale Bando 2015 under grant no. 2015YPXH4W_004.