Friendship prediction model based on factor graphs integrating geographical location

: With the development of network services and location-based systems, many mobile applications begin to use users ’ geographical location to provide better services. In terms of social networks, geographical location is actively shared by users. In some applications with recommendation services, before the geographical location recommendation is provided, the authors have to obtain user ’ s permission. This kind of social network integrated with geographical location information is called location-based social networks (abbreviate for LBSNs). In the LBSN, each user has location information when he or she checked in hotels or feature spots. Based on this information, they can identify user ’ s trajectory of movement behaviour and activity patterns. In general, if there is friendship between two users, their trajectories in reality are likely to be similar. In this study, according to user ’ s geographical location information over a period of time, they explore whether there exists friendly relationship between two users based on trajectory similarity and the structure theory of graphs. In particular, they propose a new factor function and a factor graph model based on user ’ s geographical location to predict the friendship between two users in the real LBSN.


Introduction
The rapid development of the Internet in recent years has promoted the emergence of various location-based services, which provide users with more personalised recommendation services based on their geographical location information, such as food delivery, taxi hailing, travel etc. More location-based services begin to request users' location information directly or indirectly to improve their experience. Private data protection [1] should not be neglected and become an important problem of personal privacy security. There are some research work [2,3] relevant on private data protection and some of these techniques have been applied to real-world systems. The privacy security of mobile devices is related to the interests or habits of each end user, and the approaches proposed in [4,5] on the security of private information have been paid more attention in recent years. The information shared by these mobile devices facilitates a plenty of social research. In social networks, users often share logs or photos having location information in their social communities, and the friends sharing their everyday activities are more likely to be in the same position [6][7][8], that is to say, everyday interactions between friends make their activity area have intersection which will partially reflect the correlation of their locations, i.e. trajectory similarity [9,10]. If we can discover the connections of friends from their location information, then we can improve the accuracy of existing link prediction algorithms [11], to enhance the performance of recommendation systems.
Currently, several studies based on location-based social network (LBSN) are applied to recommendation system, among which the research on friend suggestion system [7,12] often clusters users' home, work, restaurants and other central locations according to their location information and check-in records, which aims to calculate the similarity of check-in locations between two users. In addition, the author described the types of locations via information entropy [13], and then treated the intersection of locations with respect to two users as the similarity between users. Li and Chen [14] employed the method of multi-layer network combination to combine more information into the network to build a friend model. Other state-of-the-art recommendation [15][16][17] models do not explore the connection between users' location information. For example, the model proposed by Bagci and Karagoz [15] combined users' historical location and current location for the recommendation, which is helpful for improving the user's experience. Most link prediction methods focus on the importance of location to visitors, ignoring the strength [18,19]o f the relationships between those visitors. The drawbacks of these approaches lie in that: they are lack of extensibility, and each approach does work in a specific area. In addition, relevant research generally retrieve features [12,20,21] from geographical location information without taking into consideration the correlation between location information. Since different networks have different characteristics, we need to find the connections of users' geographical locations in the LSBN, and this connection is also applicable to most LBSN, that is to say, the model established based on connections between users is scalable in the LBSN. The factor graph model is a probability graph model, which plays a very important role in link prediction. Tang et al. [ 22] and Cen et al. [ 23] proposed a partially labelled pairwise factor graph model, where the relation prediction method not only obtains good performance but also has good scalability. However, for LBSN, the geographical location information shows the similar behaviour of users. In this study, the relationship between users will be extracted to build as a factor function, and we design a factor graph model to predict whether there is friendly relationship between two users.
Original contributions: The main contribution of this study is that we propose a friendly relationship learning and prediction model based on geographical location information and factor graph model in LBSN. In the proposed model, the geographic location information contained in these social networks is retrieved from user's trajectory data in LBSN, and the factor function is established based on the similarity of trajectories to learn these features. In addition, we use two real data sets in experiments, i.e. Brightkite and Gowalla, and the results show that our proposed model outperforms the state-of-the-art classification methods.
The rest of the paper is organised as follows. Section 2 introduces the problem statement and graph theory. Section 3 presents the calculation method, the definition of trajectory similarity and the analysis of trajectory similarity in the factor graph with multiple correlation. Section 4 gives the theoretical fundamentals of the factor graph and the learning and prediction phases in the factor graph. Section 5 shows the experimental results of the proposed model by comparing it with other methods. Lastly, Section 6 concludes this paper and discusses the relationship prediction approach in machine learning.

Problem formulations
Generally speaking, we define a user in the social network as a node v in the graph, and the relationship between users is defined as an edge e in the graph, where e [ v × v. Therefore, a social network is described by G = (V , E), where V and E represent the set of nodes and edges in the network, respectively. In addition to these two basic components, different heterogeneous networks include other additional information. For example, there are many unlabelled nodes E U in social networks, and each node v has a different parameter x. Based on the aforementioned concepts, we give the definition of social networks.

Definition 1: [Partially labelled attribute location-based networks]
In this network, only partial nodes are labelled, and each node contains five-tuple attribute information, the network is denoted by , where E L represents a tagged edge set which is associated with R L , E = E L < E U , C represents the location information retrieved from users' check-ins, and X is a property matrix associated with the set of users V, in which each row corresponds to a user, and each column is an attribute, one of the elements x id in X denotes the dth attribute of user v i .
From the above definition of a graph, we can further formulate the problem. For predicting the friendship of users in LBSNs, given a partially labelled attribute network, the prediction of friendship in the network can be defined by the following function: where Y is the output set of friendship which is predicted by the proposed model, and we can predict the tag type y i of all E U . Presently, most of the research studies on relationship mining in LBSNs aim to collect more features and improve the classification accuracy by proving that this information is more effective and valid. However, most of the existing approaches do not have good expansibility and cannot be applied to LBSNs.
The check-in information of users is uploaded over a period of time and the users' location information is very limited, for example, 1145 users uploaded less than five location information in Brightkite, while the complete information is 221. Therefore, to balance the number of trajectories between users, all uploaded location information is grouped by day, and then partitioned into time slices. The time of one day from 0:00 AM to 24:00 PM is divided into h time periods for location merging.
Definition 2: [Users daily activity trajectory] To distinguish between weekdays and weekends, these two kinds of trajectories are collected, respectively. The definition is given as follows: 3 Trajectory similarity measurement and multivariate correlation analysis There are many geographical correlations between users, such as the distance [24] between home, work, restaurant and so on. However, trajectory similarity [25] can best reflect the relationship between users, because the activity trajectories of users with a close relationship will affect each other, and their activities have similarity, including working, entertainment and eating. The similarity of the trajectories of the user's social activities was high between two users who are friends. Then, we show how trajectory similarity is measured, and then explore the distribution of binary and ternary similarity.

Trajectory similarity measurement
Each person has his or her own activity trajectory every day, and there are certain similarities between people who are close to each other [26]. Therefore, the measure of similarity is of great help in determining the relationship between two persons. The trajectory measurement approaches can be divided into several categories, such as common point-based measurement methods edit distance on real sequence (EDR) [27], LCSS, DTW etc. In the shape-based method, Frechet distance [28] is often applied. In the point-based measurement method, EDR not only considers the influence of noise, but also the common substring. For the activity trajectories of two users, when the distance of users with respect to a point is less than a threshold, we can regard this point as a point in a mutual sub-trajectory, which is a similar trajectory point.
In regard of the LSBN data, Gowalla and Brightkite have thousands of check-in records of users and it is time consuming to calculate the similarity of trajectories. In terms of trajectory modelling, Mazumdar et al. [ 29] proposed a method to use entropy matrix to model the user's historical data. Generally speaking, the activity trajectory of weekday users is mostly the same, while the trajectory of weekend users are often different. Therefore, before measuring similarity, we need to retrieve the user's trajectories. The weekday trajectory is the general activity track, denoted by Tr work , and the weekend track is expressed by Tr week . In addition, noise [30] may appear in the user's trajectory, which shows a big bias in latitude and longitude. So, in this study, the data satisfying d(x i , x mean ) . v are removed. In the phase of trajectory sampling, a position mean value in a certain interval is viewed as the representative point during this period. It is worthwhile to note that in a certain period of time, the user's behaviour is mostly the same. For example, before 8 AM, the user is likely to be at home, from 8 AM to 12 PM and from 14 PM to 18 PM, the user is likely to be at work. From 12 PM to 14 PM and from 19 PM to 24 PM, and the user is likely to be in a restaurant or outdoors. Based on the above discussion, we should take into full consideration these factors in the phase of trajectory sampling. After both trajectories of two users are obtained, the similarity of their trajectories can be calculated based on the EDR (edit distance on real sequence) similarity algorithm given as follows.
Definition 3: [Edit distance on real sequence(EDR)] Given two trajectory sequence of moving objects Q = {q 1 , q 2 , ..., q m } and R = {r 1 , r 2 , ..., r n }, Sim(Q, R) is used to recursively calculate whether each point in the sequence is similar to the others, it is defined as follows: where m = 0orn = 0, Sim(Q, R) = n or m,,m and n represent the lengths of the sequences Q and R, respectively, Rest(Q) and Rest(Q) indicate that pointers in the sequence Q and R move back one bit, i.e. Rest(Q) = {q 2 , q 3 , ..., q m }, and subcost is formalised as follows: where Dist(Head(Q), Head(R)) is the actual distance between the first point of Q and R. If Dist(·) is less than e, we view it as 0.
When we calculate the trajectory similarity of users, we will calculate the similarity of the two trajectories by the following equation: Sim(Tr i ,Tr j ) = min (Sim(Tr i work ,Tr j work ), Sim(Tr i week ,Tr j week )).
EDR can reduce noise points by quantifying distances to 0 and 1, and edit distance can improve the local time behaviour, especially if local time-shifting is not a big deal. The EDR results may be biased when local time trends are large. To make the result more accurate, we can calculate the similarity after normalising the trajectory. As shown in Fig. 1, in terms of two LBSNs, with the improvement of trajectory similarity, the probability of friendship between these two users will also increase. However, in the actual case, the proportion after estimating the similarity of trajectories is greater than 4 is very small.

Multivariate correlation analysis
Here, we will introduce the binary and ternary associations [31] based on the trajectory similarity algorithm in Section 3.1, and analyse the similarity distribution under different relationship combinations.
In the network, we call the common connection of two edges of the same user as a binary relationship [31]. Another special structure is that three users form a triangle relationship, which is regarded to a basic ring. Because there are three relationships, a factor is often used to represent them in a factor graph, which is viewed as a ternary relationship. Different edges in these combinations may have different similarity, so we can statistically analyse the distribution under different relationships and different combinations of similarities. From the distribution of similarity and friendship probability shown in Fig. 1, with the increase of similarity, the probability of friendship also increases significantly. Binary and ternary relationship in a factor graph is given in Fig. 2. We use the functions h(·) and g(·) to represent the factor functions of binary and ternary correlations, and we treat the trajectory similarity as the measurement to establish the features under different relationship combinations.
As shown in Fig. 3, the similarity distribution of the two edges with respect to a random node is different from the distribution of friend nodes. With regard to the Brightkite data, the similarity distribution of the edges with a friendly relationship aggregates mostly around 3, while the similarity distribution of the randomly combined edges is mostly around 1, having a difference of 2. For the Gowalla data, the similarity of friends is obviously higher, and the random edges also concentrate, with a gap of 3. In terms of binary relation, the similarity of two edges is used to calculate the difference, which can show the difference in the similarity of two edges. In terms of ternary relation, the difference of three similarity combinations are calculated, respectively, and their mean values are used to represent the feature.

New trajectory similarity factor graph model
The proposed model is based on the track similarity relationship of users in the network mined on LBSN, and the factor function is established and added into the factor graph model based on the geographical location characteristics. Before we input the original network into the model, we need to transform the original node-oriented network into an edge-oriented network. The nodes of the binary relationship in the original network are represented by a binary factor node. In addition, we need to add a triple factor function to the ternary relationship. In the proposed model, the factor functions used by binary and triadic factor nodes use the trajectory information of the adjacent nodes, while node v i contains the tag information and attribute feature vectors. Then, we propose the global probability distribution of the factor graph model as follows: where f (y i , x i ) is the factor function associated with edges in the network. In a factor graph, each node is connected with an independent factor node. y i in the function represents the type of a tag, and x i is the attribute corresponding to the node, so the factor function represents the functional relation between the node feature and the tag. The function h(y i |S i , y j |S j ) represents the functional relation between trajectories of three users in a binary relationship, and only the similarity of two pairs of users is compared. The trajectories of users can be partitioned into weekdays and weekends. Additionally, the function g(y i |S i , y j |S j , y k |S k ) represents the relation between the trajectories of three users in a ternary relation, but we need to compare the trajectories of three pairs of users. In the factor graph, the total probability distribution can be figured out by the product of each factor function. In (7), Z represents the normalised constant which is defined as follows: Equation (7) is used to calculate the normalised factor of the global distribution in a factor graph, which can be derived from the normalised factor of each function in the global distribution. These normalised factors are used to express the calculation results as a probability in the phase of probability calculation. The definition of the factor function in the factor graph is very important. We define two different factor functions based on the similarity of trajectories. Here, we will define the factor function in detail.
The factor function is defined as follows. First, it is the factor function f (·) which is independent of node and represents the relation between the node attribute and the relation tag: where Z is also a normalised constant, l T represents the parameter vector with the same dimension as x i . The function f(y i , x i )i sa n attribute vector function associated with the label y i . In (9), F represents the friendship label and S represents the stranger label (not friends). Equation (9) implies that the basic feature of a node is represented by a vector, which is used for the point product calculation with the parameter vector The factor function h(·) in the binary relation represents the relation between two adjacent nodes with real values having the trajectory similarity. There are three users in the binary relation, so there are three trajectories. Here, only the relation between these two similarity conditions and the label y of the node is considered, which are y i |S i and y j |S j , respectively. According to the aforementioned trajectory similarity measurement function, the factor function can be formulated as follows: where a T is used to represent a parameter vector, and a new function h(·) is used to obtain the new vector associated with the node label and trajectory similarity. After multiplying these two equal dimensions, a new function distribution is formed by using the power function e. As for the function h(·), the detail is given as follows: h(y i |S(i), y j |S(j)) = w(y i , y j ) · H(S(i), S(j)) T where function w(·) generates a vector for the combination of labels, so dim w(·) = dim Y 2 . The notation abs(S(i) − S(j)) is taking the absolute value. w(·) can defined as follows: where a and b represent the labels of two nodes, which means that when nodes are labelled Y a and Y b with a valid value at the corresponding position of the vector. Equation (12) represents the characteristics of the node label combination. S(·) in (11) and the previous equation represents the trajectory similarity of the users on both sides, and S(·)i sd e fined as follows: where Sim Tr a ,Tr b is used to calculate the similarity of trajectories Tr a and Tr b .W ed e fine a threshold value 1, when the similarity is greater than 1, we consider them to be similar, and then we assign a valid value. Actually, the setting of this threshold will affect the experimental results. An appropriate value can be found by analysing different algorithms through experiments. Similar to the definition of the factor function of the binary relation, the definition of the ternary relation takes into account the third user's trajectory and the label of an newly added edge, so the dimension of the parameter vector is not the same as that of the binary relation. The detailed definition is given as follows: where v [ {i, j, k} and the function G(·) in the above equations are defined as follows: Equations (15) and (16) indicate the generation of features based on trajectories' similarity, which means that we set the constant value 1 at the corresponding position in the vector. These two factor functions represent the non-linear feature representation of the similarity of the input. In reality, the number of parameters defined in a factor graph is directly related to the number of labels and the range of similarity. Model learning: In the factor function, we define a parameter vector for each factor function, that is, (l, a, b). In the phase of model learning, we need to learn these parameters, so here we use the maximised logarithmic similarity function to calculate the gradient of the parameters. For relationship nodes with labels To facilitate understanding, we define the parameter as follows: u = {l, a, b} s(y i ) =(f(y i , x i ), h(y i |S(i), y j |S(j)), × g(y i |S(i), y j |S(j), y k |S(k))) T So we can redefine joint probability of (6) as follows: Put (19) into (17) to obtain that: So here we can use the gradient descent method to solve this function. Firstly, we need to take the partial derivative of this log-likelihood objective function. Here, we solve the parameter u and the following equation can be obtained: where E p u Y |Y L ,G () S is the expectation if the graph is labelled, and E p u Y ,G () S is the expectation if the label is unknown. So, we need to calculate the global distribution of the factor graph with and without labels. The expectation given in (21) is the key step to calculate each parameter's gradient in the learning process, so we need to calculate the probability distribution of each node with and without labels in order to figure out the expectation.
An efficient method for calculating the probability distributions in factor graphs is loopy belief propagation (LBP) [31]. In the phase of learning, LBP is used to calculate the probability distribution and marginal probability of (Y |Y L , G) with labels, and then (Y , G) without labels. The first propagation of the message is different in the above two cases. In the first round of calculation, the gradient can be fuzzy and the parameters can be uniformly initialised. When the message propagation in LBP runs after a finite number of iterations, the probability distribution tends to be stable. When the change of gradient becomes smaller and less than a threshold, the algorithm converge, then we can calculate the marginal distribution of each node.
Inferring unlabelled friendly relationships: In the learning process, after a certain number of iterations, the algorithm converges, the unlabelled node V can be predicted based on the parameters u obtained in the phase of training according to the maximum and propagation algorithm by the following equation:

Datasets
In this study, we use two real location-based services network data, i.e. Brightkite and Gowalla. These two data sets also include a large amount of check-in data besides the basic edge and node information. The description of these two data sets are given as follows: Gowallathe data set contains 196,591 nodes, 950,327 edges, and 6442,890 check-in data corresponding to each user.
Brightkitethe data set contains 58,228 nodes, 214,078 edges, and 449,1143 check-in data for each user.
We compared the predicted results with the edge provided in the data where two users are ground-truth friends. The negative samples in the data set are generated by using the random sampling method, and the actual connections are established in the network and labelled. In the phase of sampling, we try to balance the number of positive and negative samples, but in the real network, negative samples will not be labelled.

Comparison methods
Inferring the friendship relation can be regarded as a classification problem, so we use the commonly-used classification methods, such as SVM and LP [32]. In experiments, we extracted many topological features and geographical location features for classification. Topological features include common neighbours (CNs), Degree, JC, PA etc. Geographical location features mainly include distance and trajectory similarity of three representative locations (home, work and restaurant). As for the effectiveness of these special detection, Bayrak and Polat [33] gives a detailed description of the link prediction on the LBSN. Some of the attributes are given as follows (Table 1): SVM: This is a supervised learning method. The data set is partitioned into the training set and testing set. SVM uses the attribute vector x i of each relational label to train the model, and its decision boundary is the maximum-margin hyperplane to classify the learning samples. The learned parameters are used for quantitative classification. We implemented this algorithm by using the SVM-light package. We mainly focused on the penalty factor C and g in SVM. In the phase of training, we tested each parameter with the grid search method to determine the optimal parameter values. In addition, we used ten-fold cross validation to group the data sets for training to avoid overfitting. LP: is a semi-supervised learning. Label propagation (LP) [32] spreads labels based on proximity to the relation. Using the relation between samples, a complete graph model is established, which is suitable for undirected graph. Each node label is propagated to the adjacent node according to trajectory similarity. At each step of node propagation, each node updates its label according to the label of its adjacent node. In the phase of LP, keep the label of labelled data unchanged so that it can transmit the label to unlabelled data. Lastly, when the iteration terminates, the probability distribution of similar nodes tends to be similar and can be grouped into a class. LP does not require tuning parameters because the phase of LP is based on the network structure. In order to obtain the best classification results, the edge weight is specified according to the topological similarity and is used to distinguish the propagation priority. The proposed method (TS-FGM): Our proposed model on factor graphs includes binary and ternary factors. In addition, we combine the factor graph model with the common binary and ternary factors, which is called the multivariate correlation factor Table 1 Summarisation of the attributes used in the basic classification method and our model. where u and v represent nodes, and the neighbours of node u are represented by G(u)

Attribute Equation Example
CN graph model (MC-FGM). By comparing the effectiveness of the two methods as factor functions, it is proved that the proposed similarity multivariate correlation can achieve better results in LBSN. In experiments, we only divided the data set into training set and testing set. We do not use the cross-validation method, because the data used in a factor graph is a complete network and the phase of calculating the probability distribution is based on the information transferred between nodes. The learning and prediction processes of these two methods are similar. In the phase of gradient descent, the proposed methods will predict the unknown labels in each iteration of calculation. The parameter gradient can be as small as possible after convergence. We use the method of dynamically changing the step size to make the models converge fast. The definition of the factor function of MC-FGM is similar to (10), which is given as follows:

Performance analysis
5.3.1 Accuracy performance analysis: According to Table 2, the proposed TS-FGM method has a great improvement in the prediction accuracy by comparing with SVM, achieving around 24% improvement in Brightkite data set and 15% improvement in Gowalla data set, respectively. When compared with the LP method, the precision is improved by about 7%, and the prediction accuracy of positive and negative samples is also higher than that of LP. The method MC-FGM is a simplified version of TS-FGM, where the similarity of trajectories is not taken into account in feature extraction. In general, no more features are generated for multivariate correlation. In terms of the prediction performance, the TS-FGM method still improves the accuracy by about 5% compared with the MC-FGM method. In the Brightkite as well as the Gowalla datasets, the predicted performance of the Brightkite data was generally superior to that of Gowalla. According to the topological analysis of these two networks, the topology structure of Brightkite is more complex than that of Gowalla, so there are more multivariate correlations, e.g. ternary correlations. The best prediction accuracy value of our method in Gowalla reached to 88.75%. In contrast, the performance of SVM is the worst and LP was stable.

Factor contribution analysis:
In the section, we will analyse the factor contribution and we analyse the predictive performance of the model by removing certain factors and combining some factor functions. As shown in Table 3, we added three factor functions one by one to compare the prediction accuracy. We can see that the prediction accuracy is very low with only attribute feature factor functions, and the prediction performance is greatly improved by adding the binary correlation factors in both data sets, i.e. Brightkite( + 6%) and Gowalla ( + 10%). According to the performance by adding ternary factors in these two data sets, the ternary correlation in Brightkite data greatly improved the prediction accuracy, making the prediction results reach to 93.65%, but the improvement in Gowalla was less obvious than that in binary correlation. In summary, the proposed similarity factor does play an important role in prediction.

Analysis of feature function:
In terms of our proposed factor function, H(·) and G(·) are used to represent the similarity feature functions under binary and ternary correlations, respectively. Generally speaking, the feature function needs to express two input variables as a valid feature value. In our model, we used the definitions of these two feature functions that can achieve the best prediction results. According to Fig. 4, the feature functions of the binary correlations performed the best for 'abs' (absolute value of difference in {S u , S v }), where S i (i [ {u, v}) represents the trajectory similarity between two users in both Brightkite and Gowalla, and better for 'min' (minimum value in {S u , S v }) than for 'max' (maximum value in {S u , S v }) and 'abs' in the ternary association. It is worthwhile to note that we can also use the Sigmod function to define the threshold to represent features if we do not consider the time complexity.

Conclusion
In this paper, we mainly studied how to extract users' geographical location connections and build a model to predict the friend relations in social networks based on the hidden information and factor graphs, and we conduct experiments on two real LBSNs, i.e. Brightkite and Gowalla. The cardinality of Gowalla data is five times that of Brightkite, so we used a sampling method for Gowalla to remove most nodes with no check-in information and the ones with very little information. Based on trajectory similarity, we studied the representation of binary and ternary network associations. Based on these preliminaries, we propose the TS-FGM model. According to the experimental results, our method is better than other classification algorithms in predicting accuracy. In terms of efficiency, Gowalla has a larger number of data, which is time consuming. In our future research, we will focus on reducing the time complexity of the message propagation process in the factor graph [23]. In addition, our experiment has also proved that the location information is indeed effective in improving the accuracy. So, if we can extract more effective