Learning near-optimal broadcasting intervals in decentralized multi-agent systems using online least-square policy iteration

Here, agents learn how often to exchange information with neighbours in cooperative multi-agent systems (MASs) such that their linear quadratic regulator (LQR)-like performance indices are minimized. The investigated LQR-like cost functions capture trade-offs between the energy consumption of each agent and MAS local control performance in the presence of exogenous disturbances, delayed and noisy data. Agent energy consumption is critical for prolonging the MAS mission and is composed of both control (e.g. acceleration, velocity) and communication efforts. Taking provably stabilizing upper bounds on broadcasting intervals as optimization constraints, an online off-policy model-free learning algorithm based on least square policy iteration (LSPI) to minimize the cost function of each agent is employed. Consequently, the obtained broadcasting intervals adapt to the most recent information (e.g. delayed and noisy agents’ inputs and/or outputs) received from neighbours whilst provably stabilize the MAS. Chebyshev polynomials are utilized as the approximator in the LSPI whereas Kalman ﬁltering handles sampled, corrupted, and delayed data.


INTRODUCTION
Characterized by local interactions among neighbours, decentralized control of multi-agent systems (MASs) is one of the most active research areas in the last two decades [1][2][3][4][5][6][7]. As discussed in the aforementioned references, information exchange among neighbours is instrumental for MAS coordination, especially in the presence of exogenous disturbances, delayed and noisy measurements. However, even for a single control loop (i.e. for a single agent), it is well known that information exchange at "slow enough" [8] or "fast but not too fast" [9,Remark 5] rates is required for stabilization in the presence of measurement noise. This data exchange rate dependency is also evident from the  2 -gain and bias terms in [10,11]. Besides attaining merely closed-loop stability, one is usually interested in achieving satisfactory control system performance [12][13][14][15][16][17][18]. These performance-oriented works consider intermittent feedback [19] and  [20] are employed to account for data acquisition, processing and/or communication costs. Recall that the classical LQR problem involves merely the control (e.g. acceleration, velocity) cost. LQR-like performance indices, which address information-related costs next to the control cost, in MAS settings are investigated in [10,21] and the present work. Moreover, intermittent information/feedback and networked control systems (NCSs) are of interest in light of real-life control system implementations [12,19]. Namely, owing to digital technology and limited communication channel capacities (i.e. bandwidth), continuous or periodic data flows among neighbours are often not achievable, but instead the data among agents is exchanged intermittently. In addition, realistic communication channels introduce delays, distort the transmitted information and give rise to packet dropouts. In consequence of these network-induced phenomena, the MAS control performance impairs and the MASs can even become unstable. At the same time, agents' sensing and broadcasting units consume energy while in operation. Since agents typically have scarce energy supplies at disposal (e.g. batteries), it may erroneously seem that sensing and broadcasting should be executed as sparsely as possible in order to prolong the MAS mission (i.e. agents' lifetime), which is the sole driving force in some of the related works (see the discussions in [10,14,22] and the references therein). For instance, compared to scenarios with "frequent" sampling and broadcasting, "infrequent" data exchange results in greater settling times, that is, in slower convergence to a vicinity of MAS equilibria impairing disturbance rejection [22]. In turn, "infrequent" data exchange may exhaust the energy supplies more than the "frequent" counterpart due to greater control efforts in the long run. This observation underlies the present work and its predecessors [10,14,22].
Given some low-level and group cooperation controllers, this article quantifies the repercussions of intermittent feedback in the group cooperation controllers on the MAS control performance and MAS lifetime. To that end, a continuoustime LQR-like optimization problem is posed for each agent involving only locally available data. Each cost function captures the local control performance versus agent lifetime tradeoffs. In theory, dynamic programming (DP) solves a wide spectrum of optimization problems providing an optimal solution [23,24], including the LQR-like optimization problems considered herein. In practice, straightforward implementations of DP algorithms are deemed computationally intractable for most applications due to the infamous curses of dimensionality [25]. Therefore the need for efficient model-based approximate dynamic programming (ADP) or model-free reinforcement learning (RL) methods (refer to [25,26] and [27][28][29], respectively). The RL method utilized herein is the online offpolicy model-free algorithm based on leastsquare policy iteration (LSPI) reported in [30,31]. The off-policy feature of the algorithm utilized herein was inherited from [32]. The aforementioned local DP problems are coupled, which in turn yields non-autonomous underlying dynamics so that the corresponding costs-to-go are non-stationary [33]. In addition to availability of often imprecise mathematical models of agent dynamics, the distribution of exogenous disturbances is usually unknown or unpredictable (e.g. winds, sea currents, waves) as well as the behavior of neighbouring agents due to the influence of other agents, disturbances and noise. Hence the need for an online model-free RL method. Moreover, online RL methods do not require the training phase (unlike offline RL methods), which may be a prohibitive factor in practice. Our LSPI approximation architecture for both continuous states and actions is based on Chebyshev polynomials. In addition, we use Kalman filtering (KF) to alleviate the problem of delayed, sampled and noisy data [34], [23,Chapter 4].
In comparison with its precursor [10], the present article provides sufficient convergence and near-optimality guarantees and experimental verification of the online LSPI. In our framework, we first compute stabilizing upper bounds for intervals between two consecutive broadcasting instants of each individual agent (utilizing agents' approximate linear time invariant [LTI] dynamical models), thereby giving rise to asynchronous communication. Afterwards, these stabilizing upper bounds are exploited as optimization constraints in the online LSPI aiming for optimality. To the best of our knowledge, the optimal MAS intermittent feedback problem is still an open problem. For example, in comparison with this work, the authors in [33] and [35] employ different RL approaches and aim for different optimization goals. The work in [36] considers a cooperative performance index, rather than individual cost functions as we do, and imposes restrictions on graph edge weights to ensure that all agents play the same game and that learning converges whereas [37] seeks the LQR control law for a single agent in deterministic settings with transmission instants given by an external event-triggered mechanism. Likewise, [38] designs/learns control laws whereas we design/learn broadcasting intervals for already designed controllers. Furthermore, [39] requires at least one pinning node, which serves as the command generator, and considers identical agents with state feedback in the absence of delays, noise and disturbances. Altogether, the main contributions of this article are threefold: (a) RL algorithm convergence conditions; (b) near-optimality bounds; and (c) experimental verification employing affordable off-the-shelf localization system and nano quadrotors as MAS agents.
The remainder of the article is organized as follows. We finish off this section with basic notation and terminology. Section 2 formulates the optimal LQR-like intermittent feedback problem for MASs. In Section 3, we bring together LSPI and  p -stability with bias of MASs with delay dynamics in order to solve the LQR-like problem of interest. Section 4 is reserved for convergence and suboptimality analyses. A low-cost experimental validation is delineated in Section 5. Conclusions and future research avenues are in Section 6.
Notation: We use (x, y) := [x ⊤ y ⊤ ] ⊤ . The dimension of a vector x is n x whereas ‖ ⋅ ‖ denotes the Euclidean norm of a vector. If the argument of ‖ ⋅ ‖ is a matrix, then it denotes the induced matrix 2-norm. The kernel of a matrix A is Ker(A). The set cardinality is denoted | ⋅ |. An n-dimensional vector with all entries 0 is denoted 0 n . The n × n identity matrix is I n .
For the sake of brevity, we write "w.r.t." instead of "with respect to". Let A directed graph is a pair  = (,  ), where  = {v 1 , … , v N } is a non-empty set of nodes, and  ⊂  ×  is the set of edges. When the edge (i, j ), i ≠ j , belongs to , it means that there is an information flow from the node i to the node j . The set of neighbours of the node v i is  i = { j ∈  : ( j, i ) ∈ }. The graph Laplacian matrix L ∈ ℝ ||×|| is defined as

PROBLEM STATEMENT
Consider N heterogeneous linear agents given bẏ where i ∈ ℝ n i is the state, u i ∈ ℝ n u i is the input, i ∈ ℝ n is the output of the ith agent, i ∈ {1, 2, … , N }, and i ∈ ℝ n i reflects (often unknown and possibly stochastic) exogenous disturbances and/or modelling uncertainties. Matrices A i , B i and C i are of appropriate dimensions. A common decentralized policy is where K i is an n u i × n gain matrix [1,2]. In real-life applications, the transmitted information about signals i (t ) and j (t ) in (2) arrive at the receivers with some delay. For simplicity, suppose that all communication links introduce the same propagation delay d ≥ 0 so that i (t ) and j (t ) in (2) can be replaced with Remark 1. Delay constancy, and hence delay uniformity (i.e. the delays in all communication channels are equal), can be enforced via time-stamping of data or protocols with guaranteed and known upper bounds on the time delay and appropriate buffering at the receiver ends (refer to [40] and the references therein). In case time-varying and/or non-uniform delays are inevitable, the present approach can readily be augmented with the ideas from [11] owing to linear agent dynamics (at the expense of more complex notation and mathematics).
Next to being delayed, signals i (t ) and j (t ) forwarded to the controllers (2) are typically sampled and noisy versions of the neighbours' outputs. We denote these signalŝi (t ) and̂j (t ), respectively. We point out that our approach does not require the zero-order-hold (ZOH) sampling (i.e.̇̂i ≡ 0 n anḋ̂j ≡ 0 n in between consecutive transmissions) but model-based estimators may be employed in order to extend the stabilizing broadcasting intervals [11] as demonstrated in Section 5. In addition to broadcasting the sampled versions of i 's at time instants belonging to  (see Figure 1), the agents may broadcast the sampled versions of u i 's as well so that state estimates built upon both agents' inputs and output can be provided to LSPI (see Section 3.2). If agents' inputs are not available (e.g. noncooperative MASs), one can readily employ input-free estimation schemes at the expense of increased estimation errors [41]. Noise related to i is i (t ) ∈ ℝ n whilst the noise pertaining to u i is i (t ) ∈ ℝ n u i . We assume i (t ) and i (t ) are bounded for each i ∈ {1, … , N }.
In what follows, we need to differentiate among the broadcasting instants in  pertaining to each agent. Consequently, let t j i ∈  , i ∈ ℕ, label the broadcasting instants of the j th agent. The broadcasting intervals j i 's are defined by with the convention t j 0 := t 0 for each j ∈ {1, … , N }. We allow broadcasting instants of different agents to coincide. Also, agents' broadcasting instants are not orchestrated by some scheduling protocol. In other words, communication between agents is asynchronous. Furthermore, let us define local discrepancy vectors x i , i ∈ {1, … , N }, as follows: According to [1][2][3][4][5][6][7], a significant portion of MASs (1) and (2) is designed to steer towards  := Ker(A cl + A cld ) in an effort to attain ‖x i (t )‖ → 0 as t → ∞ for all i ∈ {1, … , N }. In this article, we use x i 's to measure the MAS local performance while the energy effort comprises both control and broadcasting/communication efforts. We are now ready to state the main problem solved herein.
for the j th agent of MAS (1) and (2) over all sampling policies j i and for all initial conditions x j (t 0 ) ∈ ℝ n x j .
In Problem 1, the discount factor j ∈ (0, 1) makes the sum (5) finite provided that the reward r j (x j , u j , j i ) is uniformly bounded over all j i , which follows from Section 3.1. Matrices P j and R j are positive semi-definite whereas a non-negative constant S j represents the information cost incurred for sampling and transmitting j (and u j ). In (5), the expectation w.r.t. signal is denoted .
The classical LQR problem is noise and disturbance free, assumes proportional state-feedback control laws and does not consider the information cost S j [20,23]. Hence, the states (and consequently, inputs) converge to zero exponentially in the classical LQR problem allowing for j = 1 in (5) (see [36,42,43] for more). We do not seek for a stabilizing controller per se (as the related works do), but rather for when to feed a given control law with up-to-date information. Even though we consider continuous-time dynamics (1) and (2), the rewards are obtained at discrete time instants belonging to  (as seen from (5)), which renders our optimization problem a combination of the continuous-and discrete-time LQR problems. These characteristics impede straightforward comparisons of our results with the existing ones.

METHODOLOGY
Our framework is intentionally set up such that the actual approaches in Sections 3.1 and 3.2 are irrelevant. In other words, the LSPI pursuit for optimal j i 's in Section 3.2 and computation of stabilizing upper bounds on j i 's in Section 3.1 can be replaced with more appropriate/advanced methods (once available) to meet one's needs. Owing to this observation and in order to focus on the novelties of this work, we deliberately leave out the actual method from Section 3.1 of this article and replace it with a discussion of challenges that a successful method needs to overcome in light of Problem 1. Our method of choice, reported in [10, Section 4.1], explicitly utilizes the (approximate) LTI dynamics (1) and (2). A potential extension of [10, Section 4.1] towards non-uniform or distributed delays and less conservative stabilizing upper bounds on j i 's is found in [44]. However, [44] does not consider disturbances nor noise.
The goal of Section 3.1 is offline computation of upper bounds j on (1) and (2) are stable in an appropriate sense. In particular, the approach in [10, Section 4.1] provides  p -stability from to w.r.t.  and with bias (see [10, Definition 1]) for sample paths of with finite  p -norm. We point out that we are not interested in stochastic differential equations nor stochastic stability [45], but rather in  p -stability [10,11]. Hence, even though we allow for stochastic disturbances , our stability result in Section 3.1 holds solely for disturbance sample paths (i.e. trajectories or realizations) with finite  p -norms. Following [25, Section 8.2] and [28,Chapter 5], this sample path stability approach is congruent with our online LSPI method. Refer to Remark 3 for more regarding the relation between  p -stability and stochastic signals considered herein.
Subsequently, the upper bounds j on j i 's, which can be thought of as robustness margins, are utilized as optimization constraints in the online LSPI procedure of Section 3.2 while obtaining near-optimal j i 's in the sense of Problem 1. In other words, we are trading off optimality of Section 3.2 for stability of Section 3.1 as discussed in [29,Section 4.1]. Accordingly, our solutions of the optimal problems (5) provably yield control system stability, which is a perennial problem when employing RL or ADP (see [29] or any RL/ADP reference herein for more), in addition to (sub)optimality of broadcasting instants. Furthermore, we employ KF in Section 3.2 to handle partially observable DP problems. It is to be noted that the (approximate) models (1) and (2) are employed in Section 3.1 and only for KF in Section 3.2. In other words, the RL part of our approach is completely model free as motivated in Sections 1 and 5.

Stabilizing broadcasting intervals
The method utilized in this section needs to be applicable to MASs with general heterogeneous continuous-time linear agents (not merely to single-or double-integrator dynamics) with exogenous disturbances, asynchronous data exchange, output feedback and directed communication topologies without pinning nodes. Furthermore, we favour approaches that allow for communication delays greater than the sampling period (the so-called large delays). At the moment, only [10, Section 4.1] takes into account the above problem settings along with concurrent adverse effects of the realistic data exchange phenomena such as sampled, corrupted and delayed data as well as lossy communication channels. Consequently, even though there exist a number of works investigating time-or self-triggered control strategies for MASs (refer to [46] for a list of such approaches), no other approach is applicable to settings as general as those considered herein. However, when some of the settings are relaxed or as novel approaches become available, alternatives for computing stabilizing upper bounds on j i , j ∈ {1, … , N }, i ∈ ℕ, may be employed as well.
The sufficient conditions in [10, Section 4.1] typically entail the existence of a directed spanning tree (see [2,22]), but one can envision less challenging/appealing MAS scenarios in which this directed spanning tree requirement is superfluous. In addition, since we are not interested exclusively in a single multiagent differential graphical game, [10, Section 4.1] imposes no requirements on graph edge weights found in [36]. Accordingly, subsets of our agents can effectively be involved in distinct multi-agent differential graphical games.
Lastly, [10, Section 4.1] allows for model-based estimation in order to increase stabilizing upper bounds on j i 's as exemplified in Section 5. These model-based estimators are not to be confused with stochastic filters/estimators employed in Section 3.2. The present subsection does not utilize stochastic estimates (nor predictions based upon these estimates) for feeding the controllers owing to the lack of explicit guarantees on estimation error upper bounds. Effectively, i (t ) or i (t ), i ∈ {1, … , N }, may cease to be bounded with a pre-specified constant. In practice, stochastic filtering can safely be used in control laws after the initial estimation error overshoot vanishes.

Learning near-optimal broadcasting intervals
For notational convenience, let us drop the agent identifier j while solving (5). Recall that N LSPI algorithms are executed in parallel (one on each agent) asynchronously.
Adaptive optimal algorithms from the literature are typically based on solving the Bellman equation [47]. In the present work, we use an online off-policy LSPI, which interacts with the control system at time instants t i , where i is the iteration step, i ∈ ℕ, through the following three variables x(t i ), (t i ) and r (t i ) found in (5). The first variable (i.e. the LSPI state) is the local discrepancy vector x(t i ) ∈ , which represents the local agent knowledge at time t i and, in the present work, is obtained via estimation (e.g. one KF at each agent for all its neighbours). The second variable is the LSPI action (t i ) ∈ . The third variable is the LSPI reward function r (t i ) defined in (5), which measures the local performance after taking the action (t i ) in state x(t i ). In other words, we are solving the sequential decision-making problem (5) and LSPI aims to provide optimal actions (t i ) for each state x(t i ).
Using policy iteration, policies h (x(t i )) are evaluated by iteratively constructing an estimate of the LSPI state and action value function called a Q-function. The use of Q-functions alleviates the problem of computing the expectation in (5) as discussed in [25,Section 8.2]. Following [30], the LSPI state-action approximate value function is defined bŷ where is the Kronecker product of the basis function vectors ( (t i )) and (x(t i )) formed with Chebyshev polynomials while is the approximation parameter vector that is being learned.
Observe that the Kronecker product renders the basis Φ orthonormal provided that bases and are orthonormal. Clearly, since Chebyshev polynomials are defined on the interval [−1, 1], the state space  needs to be compact. The set  is also compact and its upper bound is computed in Section 3.1. The lower bound of  is a small positive constant > 0 stemming from hardware limitations (see Section 5 for more). Hence,  := [ , ]. Owing to the compactness of  and , finiteness of r (t i ) is obtained so that the cost function in Problem 1 is well defined. The control action (t i ) ∈  is given by where where "u.r.a." stands for "uniformly chosen random action" and yields the algorithm's exploration phase every steps while h (x(t i )) is the policy obtained according to Since LSPI tackles optimization problems, the algorithm needs to have an exploration phase in order to avoid local minima and to cope with non-stationary cost-to-go functions [28,29,33]. In Section 5, owing to excursions from their perceived stationary distributions, the disturbances and measurement noise also serve as probing signals (or exploration noise) that ensure persistency of excitation aiding the exploration [20,23,Chapter 11]. The parameter vector is updated every ≥ 1 steps from the projected Bellman equation for model-free policy iteration that, according to [32], can be written as where originates from (5) and where Γ i , Λ i and z i represent the policy evaluation mapping and projection operator of the Bellman equation [28] and are updated at every iteration step i. The vector , updated via (10), is used to improve the Q-function estimate (6). Afterwards, new and improved policies (in the sense of Problem 1) are obtained from (9). For more details regarding the LSPI algorithm employed herein, refer to [30] and [32].
Remark 3 ( p -stability and stochastic inputs). KF requires the disturbances i 's to be Gaussian white processes. As the Gaussian white processes are not Lebesgue integrable, one cannot invoke [10, Theorem 1], needed in Section 3.1, at once. However, the Gaussian white process is an idealization convenient for a number of analytical purposes, but does not exist as a physical phenomenon (see [48,Chapter 6], [49,Chapter 10] and [50]). Instead, band-limited approximations/realizations are encountered in practice. Therefore, all sample paths of i 's in Section 5 possess finite  p -norms and [10, Theorem 1] readily applies.
Remark 4. The investigation regarding suboptimality of the obtained broadcasting intervals for each agent is found in Section 4. The suboptimality repercussions of the KF-LSPI and LSPI-LSPI interplay in the MAS setting is still unclear. The generality of our MAS settings (e.g. partially observable and coupled (individual not collective) DP problems with external disturbances) hinders a fruitful comprehensive investigation at this point in time.
Remark 5. The benefits of LSPI include implementation simplicity due to linear-in-parameter approximators, which in turn strips the algorithm down to a set of linear matrix equations (10). Accordingly, the principal numerical concern lies in the need for matrix inversion, which is a well-understood nuisance circumvented, among others, via the LU factorization [51]. The complexity of solving (10) for is at most O(n 3 ), but can be reduced by using computationally efficient methods to O(n 2 ) as discussed in [28,Chapter 5]. Clearly, one can always decrease the numerical complexity of LSPI by employing lower-order Chebyshev polynomials at the expense of greater suboptimality in Section 4.4 (while system stability is still guaranteed in light of Section 3.1). Furthermore, the approximation in LSPI focuses on the value-function representation, thereby eliminating the "actor" part in the actor-critic framework of related approximate policy-iteration algorithms (see [33] and the references therein). Consequently, LSPI eliminates one potential source of inaccuracy and decreases computational complexity with respect to the actor-critic framework. Since LSPI is a model-free algorithm, it does not need a model for policy iteration nor it needs to learn one. Additionally, LSPI is able to learn (sub)optimal policies using a small number of samples when compared to conventional learning approaches such as the classical Q-learning. For a comprehensive discussion, refer to [32].

ONLINE LSPI-CONVERGENCE AND SUBOPTIMALITY
In this section, we discuss online LSPI convergence and suboptimality for each individual agent assuming the external disturbance properties change slower than the learning takes place. Recall that we are not interested in the MAS collective behaviour but in agents' individual behaviours as visible from the optimization problem (5) assigned to every agent. For notational simplicity, this section uses ⋅ k instead of ⋅(t k ) found in other sections. In addition, this section treats more general problems than Problem 1 so that its results do not apply merely to Problem 1. For example, even though the actions k in other sections are positive scalars, this section allows for non-scalar actions for generality sake. To avoid potential confusions, the actions in this section are denoted a k instead of k . Also, in order to make the notation of this section more in line with the RL literature, we slightly abuse notation found in other sections.

Reinforcement learning preliminaries
The goal of RL is to solve a stochastic discrete-time optimal control problem typically formalized as a Markov decision process (MDP) (, , f, ) [52], where  ⊆ ℝ n x is the state space of the process,  ⊆ ℝ n a is the action space, f : is the transition probability function of the process, that is, the probability of making a transition to state x k+1 when taking action a k in state x k , and :  ×  ×  → ℝ is the reward function r k+1 = (x k , a k , x k+1 ).
An important property of MDPs is that, under appropriate conditions, there exists a deterministic optimal policy despite the fact that the transitions are stochastic [20,23]. Hence, the agent chooses actions according to its (deterministic) policy h :  →  (cf. (7)) In particular, the RL objective is to minimize the infinitehorizon expected discounted return (cf. (5)) where ∈ [0, 1] is the discount factor and x k+1 is drawn from the density f (x k , a k , ⋅) at each step k. In other words, one wants to find an optimal policy h * that procures the minimal value function from any initial state x 0 Any policy that attains the minima in this equation is optimal. When solving undiscounted problems (i.e. = 1) with unbounded rewards, one must impose stability conditions to ensure that values are bounded and the problem is well posed. Thus, the discount factor is usually taken subunitary and rewards are assumed to be bounded, which makes the value functions well behaved irrespective of stability concerns. Herein, we first ensure stability in Section 3.1 so that the reward boundedness assumption is superfluous for well posedness of Problem 1. Furthermore, instead of directly using value functions, LSPI uses Q-functions Q :  ×  → ℝ that fix the initial action. The intuitive meaning of the Q-value Q h (x, a) is that of expected return when starting from state x, applying the first action a, and following h thereafter. The reason for preferring Q-functions is simple: once Q * is available, an optimal (greedy) policy can be computed easily by selecting at each state an action with the smallest optimal Q * value (cf. (9)): When there are multiple minimizing actions, any of them is optimal. In contrast, the formula to compute h * from V * is more complicated and involves a model, which in our setting is unavailable. The state value functions can be expressed in terms of Q-functions as

Q-iteration
The aforestated Q-functions Q h and Q * are characterized by the well-known Bellman equations as follows The right-hand sides of each Bellman equation can be interpreted as mappings T, T h :  →  applied to the corresponding Q-functions, where  denotes the space of Q-functions. Thus, the Bellman equations (13) and (14) can be written as the following fixed-point relations It is well known that T , as well as T h , is a contraction with factor < 1 in L ∞ -norm [25,28], that is, Thus, T has the unique fixed point Q * given by (15) [53,54]. Accordingly, an arbitrary initial Q-function Q 0 can be iterated to eventually reach Q * : which is known as the Q-iteration [23,24,28,29]. Lastly, (15) and (16) yield

Policy iteration convergence
Instead of iterating (17) at once, policy iteration first evaluates policies by constructing an estimate of state-action value function Q h k of the current policy h k using (13), that is, it solves This Q-function is then used to construct a new improved policy via (9), which is greedy with respect to the state-action value function. As seen from (18), policy iteration is linear in the Q-values, which is an important advantage of policy iteration over Qiterations (17). For example, assuming the policy evaluation is exact, each policy improvement is guaranteed to find a strictly better policy unless it is already optimal [20,Chapter 11]. When finite state-action spaces are considered, the algorithm converges in a finite number of iterations since there is a finite number of possible policies.
Alas, policy evaluation typically cannot be solved exactly in continuous state-action spaces, and the Q-functions have to be approximated. The approximation (6) utilized herein is linear in parameters , which proves useful when solving the projected Bellman equation (10) and retains linearity. Following [28,Chapter 3], approximated policy iteration converts (17) into (cf. (10)) where F ( ) equals the right-hand side of (6) while the projection P (Q) equals ΦQ when orthonormal bases (e.g. Chebyshev polynomials) are employed. Let us analyze the expansiveness coefficients of F and P. Using the induced operator norm of Φ ⊤ (x, u), we reach where M b and N b are the numbers of basis functions, that is, of Chebyshev polynomials, for approximating x and u, respectively. Furthermore, upper bounding integrals for obtaining Chebyshev coefficients in each component of P (Q) − P (Q ′ ) renders Clearly, the upper bound can be kept below unity by lowering in Problem 1 as the approximation capabilities increase, which leads us to the following sufficient condition for algorithm convergence: Theorem 1. If < 1, then the composite mapping P • T • F built upon Chebyshev polynomials is a contraction, that is, (19) converges to a unique fixed point.
The same procedure is applicable to general basis functions, even though orthonormal bases are more convenient leading to straightforward and elegant expressions for as illustrated herein. Moreover, this convergence result is somewhat unrealistic as it tacitly assumes ideal conditions (e.g. availability of an infinite number of data samples gathered with an adequate exploration), which motivates the following subsection. Also, notice that Theorem 1 supports the well-known guidelines regarding unwanted overfitting capabilities of the approximator [23,24,28]. Theorem 1 corresponds to the contribution (a) stated in the introduction.

Near-optimality bounds
Even though (18) can be solved iteratively (similarly to (17) since T h is a contraction), LSPI solves (18) at once by constructing the projected Bellman equation and employing the fact that it boils down to a system of linear equations (10). All details of the steps behind the transition from (6) and (18) to (10) are found in [32], [28,Chapter 3] and [24, Section 6.3]. Using the lawof-large-numbers arguments, [32] and [24,Section 6.3] show that (10) asymptotically converges to (18) with Q-functions projected to the space spanned by (6) as the number of samples (x i , i , r i , x i+1 ) in (10)-(11) approaches infinity. Hence, as the experiments in Section 5 and simulations in [10] progress, the approximationQ approaches Q * because more and more process data (x i , i , r i , x i+1 ) are collected. Alas, not much can be said when merely a finite number of samples is available in (10), which is labelled the optimistic version of LSPI. This is the first of the two principal future research avenues on our agenda. However, as discussed in [24,Chapter 6], the optimistic version tends to explore better because with rapid changes of policy there is less tendency of bias towards particular states that are favoured by a single policy. This subsection corresponds to the contribution (b) stated in the introduction. Provided that we have collected enough samples, the second research avenue is related to the order of Chebyshev polynomials, that is, their approximation accuracy [55]. Namely, it is well known [24] that if the approximate policy evaluation is accurate to within in the  ∞ sense, that is, if then in the limit as k → ∞ the following near-optimality bound holds Moreover, if the sequence of obtained policies converges to someh, then the following tighter bound holds: In other words, our second research avenue is to a priori select the order of Chebyshev polynomials that are -accurate, in the sense of (20), for the sequence of unknown functions Q h k , k ∈ {1, 2, …}. Clearly, since this is an online algorithm, computationconvergence trade-offs are inevitable. Fortunately, Section 3.1 ensures closed-loop stability along the way whilst Chebyshev approximators ensure fast convergence [56].

EXPERIMENTAL VERIFICATION
This section corresponds to the contribution (c) stated in Section 1.

Flying area and positioning system
Experimental area of the Laboratory for Intelligent Autonomous Systems (LARIAT) is a 4 × 3 × 2 meter cuboid whose conceptual model is depicted in Figure 2. For localization, two HTC Vive Lighthouse stations [57] and Bitcraze Lighthouse positioning decks mounted on each quadrotor (refer to Figure 3) are used. Bitcraze Lighthouse positioning decks are introduced by Bitcraze AB [58] and, at the moment, they are still running in an early access program. Together with HTC Vive Lighthouse stations, they proved to be an affordable high-precision alternative to costly motion capture systems with positioning errors within 5 cm for our experimental setup. Both lighthouse stations sweep the space with laser lights in the horizontal and vertical directions at 120 Hz alternating fashion. To determine the relative angle of the positioning deck with respect to the stations, synchronization pulse is emitted before every sweep. When the positioning deck is reached by horizontal and vertical sweeps of both stations, the quadrotor position is calculated onboard as the intersection of two vectors

Hardware
Flying vehicles used in the experiments are Crazyflie 2.1 nano quadrotors shown in Figure 3. These are open-source hardware and software development platforms manufactured by Bitcraze AB. With the 27 g take-off weight and dimension of 92 mm in diagonal, they are well suited for indoor MAS experiments. Each Crazyflie is equipped with an inertial measurement unit (IMU) for the three axis: accelerometer, gyroscope and magnetometer, together with a high-precision pressure sensor. This hardware was further extended with additional sensors, the aforementioned Lighthouse positioning decks, using the Crazyflie expansion port. Sensor readings and onboard computations are processed in a 32-bit, 168 MHz ARM microcontroller unit. The

Firmware
The central part of the Crazyflie firmware is the stabilization loop. It coordinates the state estimation and control algorithms at 1000 Hz. State estimation is provided by an onboard running EKF [59,60]. This EKF obtains the IMU measurements at 500 Hz and the aforementioned position measurements from the Bitcraze Lighthouse positioning deck at 30 Hz. Quadrotor position, velocity and attitude estimates are updated at 100 Hz. For position and attitude control, we use the proprietary PID controllers. For more regarding the Crazyflie firmware, the reader is referred to [61].

Base station
The relevant hardware specifications of our base station PC are: Intel ® Xeon ® Silver 4110 @ 2.

System identification
Identification of agents' dynamics was carried out in two steps.
In the first step, the velocity reference from Figure 5 was sent to each agent and positions of all agents were recorded. Subsequently, transfer functions in both xand y-axis directions were obtained for all agents using the Matlab System Identification Toolbox. The parameters of transfer functions were estimated through iterative numerical search minimizing a weighted sum of squares of the errors between the model output and measured responses [63]. Since there are no major differences among the obtained transfer functions nor axes, we settle for the transfer function with K p = 0.95312, T p = 1.1266 ⋅ 10 −5 and T d = 0.45 s for all agents in both xand y-axis directions. Clearly, T p is negligible so we set it to zero in what follows. It is to be pointed out that higher-order transfer functions did not show significant improvements in data fitting, that is, the improvements were below 0.5%. The above transfer function was validated in the second step. Namely, every agent was given velocity reference individually and its position was recorded. The gathered data in x-axis direction are shown in Figure 5. A minimal mismatch between the model and measurements can be noticed owning primarily to initial positions not being exactly at zero. From the experiments, we conclude that the obtained transfer function captures the system dynamics quite well and can be used for model (1).

Consensus control
Altogether, we have a team of five agents with dynamics (cf. (1)) Clearly, the first agent is the leader. In (2), we select K 1 = ⋯ = K 5 = 0.35. Using Section 3.1, the resulting for ZOH sampling is 5 ms. However, one cycle of the main Matlab script that implements Section 3.2 in x-axis and sampled-data periodic consensus in y-axis, logs all experimental data and utilizes four Crazyradio PAs for communication among the base station and agents takes 6.46 ms to execute. In addition, the four Crazyradio PAs lead to worst-case communication period of 15 ms for a single agent, which corresponds to in our experiments and is the userselected data exchange period in y-axis. It is to be noted that the x-axis data is exchanged according to our LSPI at the upcoming multiple of 6.46 ms, which is yet another reason for a model-free approach since actions i effectively experience jitter (because 15 ms is not a multiple of 6.46 ms).
Clearly, of 5 ms with ZOH is not feasible (since it is smaller than 15 ms). One can mitigate this issue by investing in more advanced hardware (in order to decrease 6.46 and 15 ms), which is typically a costly and time-consuming option, or by making better use of the hardware at hand, which is the direction pursued herein. Hence, in an effort to obtain MATIs greater than =15 ms, we introduce the following model-based esti-matoṙ̂( which results in = 45 ms. Apparently, the delay d is greater than the broadcasting interval (i.e. the so-called large delay). The impact of the values of K i 's on in the delay-free setting is discussed in [22]. It is worth pointing out that [44] yields the MATI of 48 ms whereas the experimentally obtained sampleddata MATI is 50 ms (i.e. all agents exchange information at the same time every 50 ms). Clearly, the approach adopted herein for reaching is quite accurate. The price paid for obtaining this relatively large of 45 ms is seen in (24). Namely, straightforward implementations of the model-based estimator (24) are in friction with the decentralized nature of our framework. Essentially, since C cl A cld is not block diagonal, (24) requires each agent to estimate the state of the entire group (not merely of itself and its neighbours as stated in the previous sections). To circumvent this nuisance in our experiments, the agents also exchange their local state estimates and forward state estimates of their neighbours (in addition to exchanging their own inputs and outputs as discussed in earlier sections). Surely, a more thought-out and formal approach is in order for larger groups and more complex topologies, but that is out of scope of the present work. For completeness, let us say that the greatest MATI obtained via purely decentralized model-based estimator, obtained after setting all non-diagonal entries of C cl A cld in (24) to zero, is merely 9 ms, which is an improvement upon 5 ms but still smaller than the aforementioned hardware limitation of 15 ms.
From [64,Corollary 3.4.], we infer that j i , j ∈ {1, … , N }, i ∈ ℕ, bounded away from zero, say by an arbitrary > 0, suffice to fulfill the detectability requirement for KF convergence. Basically, one can find K k for any j i ∈  := [ , ] such that each state matrix norm in [64,Corollary 3.4.] is upper bounded by the same constant strictly less than unity. Owing to duality between detectability and stabilizability, the stabilizability requirement for KF convergence is obtained via [64,Corollary 3.4.] as well.
The selected tuning parameters for LSPI are: = 30, = 0 and nine Chebyshev polynomials are used for both  and  of all agents. In addition, each component of x j , j ∈ {1, … , N }, belongs to the interval [− 30,30], which in turn defines . The cost function parameters are 1 = ⋯ = 5 = 0.9, P 2 = 5, P 3 = P 4 = P 5 = 5I 2 , R 1 = ⋯ = R 5 = 5, S 1 = S 3 = S 4 = 50 and S 2 = S 5 = 0. Notice that, since the first agent is the leader, its local discrepancy vector x 1 is not defined; hence, P 1 does not exist. The broadcasting intervals are initialized to . Figure 6 provides the obtained numerical results suggesting that Theorem 1 is rather conservative in the sense of upper bounds on resulting in algorithm convergence.
A remark is in order regarding the choice of P j , R j and S j . On the one hand, as we decrease S j and keep P j and R j fixed, the obtained broadcasting intervals j i approach . On the other hand, as S j becomes greater, j i approach as seen from Figure 7a. Notice that this remark is expected and underlines the principal motivation for this work-overly sparse information exchange is not preferable a priori and needs to be applied thoughtfully, especially when the relative cost of communication with respect to the cost of control performance (e.g. stabilization or tracking performance) and control effort (e.g. movement) is low. However, the related literature does not take into account this cost viewpoint, but solely focuses on minimizing the communication effort.
A major source of disturbances in our experiments are nonlinear agents performing motions in close vicinity to each other. Namely, the downwash of each drone is quite strong and can destabilize other drones flying nearby of below it. Similarly, coupling effects among xand y-axis of each nano quadrotor are also disturbances.
Each agent learns its own approximation parameter vector j , which in turn renders the corresponding policy h j (x j (t j i )) and actions j i 's. Convergence of j 's indicates a successful completion of the learning process (see Figure 7b), while j i 's do not necessarily display smooth behaviors (see Figure 7a). Furthermore, observe that j 's are different for each agent even though homogeneous agents are encountered (which is the motivation behind utilizing homogeneous agents in this numerical example). Figure 8 shows the enclosed experiment and the full video is available at [65]. Finally, note that somewhat periodic broadcasting instants stem from the stationary distributions and, to a certain extent, justify the traditional and widespread periodic feedback.

CONCLUSION
This article studies MAS optimal intermittent feedback problems with LQR-like goal functions that capture local MAS performance versus agent lifetime trade-offs. Starting off with provably stabilizing upper bounds on agents' broadcasting intervals, we bring together KF and an online off-policy modelfree LSPI method to tackle coupled partially observable DP problems. Subsequently, convergence and suboptimality of the proposed approach are addressed. Our framework is applicable to general heterogeneous linear agents with output feedback in the presence of disturbances, to realistic communication environments (e.g. intermittency, noise, delays and dropouts) as well as to directed and unbalanced communication topologies. Lastly, an off-the-shelf testbed is put together to demonstrate the benefits of proposed methodology.
In the future, our main goals are to investigate the impacts that finite number of samples and Chebyshev polynomials have on LSPI convergence and suboptimality. In this light, exploration-enhanced LSTDQ( )-based PI algorithms will be considered owing to better understood convergence and noise robustness [24,Section 6.6]. In addition, it is of interest to tackle time-varying topologies and non-linear agent dynamics.