Volume 10, Issue 4 p. 289-298
Article
Free Access

Multi-object tracking using dominant sets

Yonatan T. Tesfaye

Yonatan T. Tesfaye

D.P.P.A.C., University IUAV, Venice, Italy

Search for more papers by this author
Eyasu Zemene

Eyasu Zemene

D.A.I.S., University Ca’ Foscari, Venice, Italy

Search for more papers by this author
Marcello Pelillo

Marcello Pelillo

D.A.I.S., University Ca’ Foscari, Venice, Italy

Search for more papers by this author
Andrea Prati

Corresponding Author

Andrea Prati

D.P.P.A.C., University IUAV, Venice, Italy

D.I.I., University of Parma, Italy

Search for more papers by this author
First published: 19 April 2016
Citations: 12

Abstract

Multi-object tracking is an interesting but challenging task in the field of computer vision. Most previous works based on data association techniques merely take into account the relationship between detection responses in a locally limited temporal domain, which makes them inherently prone to identity switches and difficulties in handling long-term occlusions. In this study, a dominant set clustering based tracker is proposed, which formulates the tracking task as a problem of finding dominant sets in an auxiliary edge weighted graph. Unlike most techniques which are limited in temporal locality (i.e. few frames are considered), the authors utilised a pairwise relationships (in appearance and position) between different detections across the whole temporal span of the video for data association in a global manner. Meanwhile, temporal sliding window technique is utilised to find tracklets and perform further merging on them. The authors’ robust tracklet merging step renders the tracker to long term occlusions with more robustness. The authors present results on three different challenging datasets (i.e. PETS2009-S2L1, TUD-standemitte and ETH dataset (‘sunny day’ sequence)), and show significant improvements compared with several state-of-art methods.

1 Introduction

To ensure security of people and places by means of intelligent video surveillance, the current trend is to increase the number of cameras installed, both as a deterrent and to better monitoring the surveyed area. As a consequence, we have witnessed an exponential increase of the data to be watched and stored, requiring inevitably the use of automatic processing of videos for scene analysis and understanding, both for real time and a-posterior mining. In this way, a large variety of video data provided by installed cameras is automatically analysed for event detection, object and people tracking as well as behaviour analysis. These actions offer a valid support to investigations and crime detection [[1]-[3]].

Tracking targets is a challenging task: variations in the type of camera, lighting conditions, scene settings (e.g. crowd or occlusions), noise in images, variable appearance of moving targets and the point of view of the camera must be accounted. Following multiple targets while robustly maintaining data association remains a largely open problem. In [[4]] a large experimental survey of various tracking approaches is presented, evaluating suitability of each approach in different situations and with different constraints (e.g assumptions on the background, motion model, occlusions etc.).

In recent years, due to significant improvements in object detection, several researchers have proposed tracking methods that associate detection responses into tracks, also referred as association based tracking (ABT) techniques [[2], [5]-[9]]. An offline trained people detector is used to generate detection responses, then tracklets are produced by linking these responses and further associated into longer tracks. Similarity between tracklets (i.e. the linking probabilities) is based on the motion smoothness and appearance similarity. A Hungarian algorithm is often used to find the global optimum [[5], [7]]. As compared with generative models, ABT is powerful in assessing the presence of objects on the scene since it uses discriminatively trained detectors, and needs no manual initialisation. Association-based approaches are prone to handling long-term occlusions between targets and the complexity is polynomial with the number of targets present. In order to differentiate between different targets, speed and distance between tracklet pairs are often used as motion descriptors, whereas appearance descriptors are often based on global or part-based colour histograms. Nevertheless, how to deal with motion in moving cameras and how to better distinguish nearby targets remain key issues that limit the performance of ABT.

In most of the ABT works [[5], [10]], the affinity score between detection responses is computed once and kept fixed for all the later processes. Conversely, in this proposal we develop a more flexible approach where associations are made at two levels and the affinity measure is iteratively refined based on the knowledge retrieved from the previous level. Moreover, most of the previous methods [[11]-[14]] use locally-limited temporal information by focusing only on the pairwise relationship of the tracklets, rather than applying data association among multiple tracklets over the whole video in a global manner. As a consequence, existing approaches are prone to identity switches (IDS) (i.e. assigning different labels/IDs to the same target) in cases where targets with small spatial and appearance differences move together (which are rather common cases in real security videos).

In this paper, a dominant set clustering based tracker is proposed, which formulates the tracking task as the problem of finding dominant set clusters in an auxiliary undirected edge-weighted graph. Unlike the previous approaches, the proposed method for data association combines both appearance and position information in a global manner. Object appearance is modelled with a 9 × 9 covariance matrix feature descriptor [[15]] and the relative position between targets which is less influenced by the camera motion (angle of the view) is computed. Since when the camera moves all the targets shift together, this motion feature is quite invariant to camera movements, making it a suitable representation applicable also to videos acquired by a moving camera. Then, a temporal sliding window technique is utilised to find tracklets and perform further merging on them. More specifically, given detection responses found along the temporal window, we will represent them as a graph in which all the detections in each frame are connected to all the other detections in the other frames, regardless of their closeness in time, and the edge weight depicts both appearance and position similarity between nodes.

A two-level association technique follows: first, low-level association is performed linking detection responses of the last two consecutive frames of the window, which helps differentiating difficult pairs of targets in a crowded scene; then, a global association is performed to form tracklets along the temporal window. Finally, different tracklets of the same target are merged to obtain the final trajectory.

The main contributions of this paper are:
  • this paper is the first example of the use of dominant set clustering for data association in multi-target tracking;

  • the two-level association technique proposed in this paper allows us to consider efficiently (also thanks to the superior performance of dominant set clustering on identifying compact structures in graphs) the temporal window at once, performing data association in a global manner; this helps in handling long-lasting occlusions and target disapperance/reappearance more properly;

  • the consensus clustering technique developed to merge tracklets of the same target and obtain the final trajectories is first introduced in this paper, although it has been used in different domains;

  • the proposed technique outperforms state-of-the-art techniques on various publicly-available challenging data sets.

The rest of the paper is organised as follows: related works are discussed in Section 2; background knowledge on dominant set clustering frameworks is given in Section 3 while Section 4 details our tracking framework (DSC tracker, hereinafter). Experimental results are shown in Section 5, followed by conclusions and future works in Section 6.

2 Related work

Target tracking has been and still is an active research area in computer vision and the amount of related literature is huge. Here, we concentrate on some of earlier related works on tracking-by-detection (or ABT) methods.

Recently, tracking-by-detection methods [[2], [5]-[9]] become the most exploited strategy for multi-target tracking. Obtaining multiple detection responses from different frames, it performs data association to generate consistent trajectories of multiple targets. As the number of frames and targets increases, the complexity of data association problem also increases along with it. As a consequence, most approaches aim to either approximate the problem or to find locally-optimal solutions. Early methods for multi-target tracking includes joint probabilistic data association [[16]] and multi-hypothesis tracker [[17]]. These techniques aim to find optimal assignment over the heuristically-pruned hypothesis tree which is built over several frames of the video.

More recently, researchers formalise data-association task as a matching problem, in which detections in consecutive frames with similar motion pattern and appearance are matched. Bipartite matching is the best known example of such methods [[18]]: the method is temporally local (considering only two frames) and utilises Hungarian algorithm to find the solution. However, their approach suffers from the limited-temporal-locality in cases where target motion follows complex patterns, long-lasting occlusions are present or targets with similar spatial and appearances exist. On the other hand, other researchers in [[9], [12], [19], [20]] follow a method generally termed as global: this approach has attracted much attention due to its capability to remove the ‘limited-temporal-locality’ assumptions and to incorporate more global properties of a target during optimisation (which, at the end, helps overcoming the problems caused by noisy detection inputs).

In [[19]] data association problem is mapped into a cost-flow network with a non-overlap constraint on trajectories. The optimal solution is found by a min-cost flow algorithm in the network. In [[20]] the graph was defined similar to [[19]] and showed that dynamic programming can be used to find high quality sub-optimal solutions. The paper in [[12]] uses a grid of potential spatial locations of a target and solve association problem using the K-shortest path algorithm. Unfortunately, their method completely ignored appearance features into the data association process, which result unwarranted IDS in complex scenes. To solve this limitation, the proposal in [[21]] incorporates the global appearance constraints.

The above-mentioned global methods use simplified version of the problem by only considering the relationships of detections in consecutive frames. Despite the good performance of these methods, they came short in identifying targets with similar appearance and spatial positions. Conversely, in recent approaches [[2], [9]], no simplifications in problem formulation are made. However, the proposed solutions are approximate due to the complexity of the models. More specifically, Andriyenko and Schindler [[2]] proposed a continuous energy minimisation based approach in which the local minimisation of a non-convex energy function is performed exploiting the conjugate gradient method and periodic trans-dimensional jumps. Compared with the aforementioned global methods, this approach is more suitable for real-world tracking scenarios. On the negative side, however, the solutions found to the non-convex function can be attracted by local minima.

Recently, in [[9]] tracking problem is represented in a more complete manner, where pairwise relationships between different detections across the temporal span of the video are considered, such that a complete K-partite graph is built. A target track will be represented by a clique (sub graph where all the nodes are connected to each other), and the tracking problem is formulated as a constraint maximum weight clique problem. However, since a greedy local neighbourhood search algorithm is followed to solve the problem, also this approach is prone to the local minima case. Moreover, due to heuristic line fitting approach used for outlier detection, the approach is prone to IDS, particularly when targets with close position move together.

3 Dominant set clusters and their properties

The theoretical formulation of dominant set clustering has been introduced in [[22], [23]]. It is a combinatorial concept in graph theory that generalises the notion of the maximal clique to edge-weighted graphs. We can represent the data to be clustered as an undirected edge-weighted graph G = (V, E, w) with no self-loops, where V = {1,…, n} is the set of vertex which corresponds to data points, EV × V is the edge set representing neighbourhood relationships, and urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0001 is the (positive) weight function which quantifies the similarity of the linked objects. As customary, we represent the graph G with the corresponding weighted adjacency (or similarity) matrix, which is the n × n non-negative, symmetric matrix A = (aij) where:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0002(1)
Since there are no self-loop in graph G, all entries on the main diagonal of A are zero. In an attempt to formally capture this notion, we need some notations and definitions.
Let SV be a non-empty subset of vertices and i ∈ V. The (average) weighted degree of i w.r.t. S is defined as:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0003(2)
Moreover, when urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0004, we can measure similarity between nodes j and i, with respect to the average similarity between node i and its neighbours in S as: ϕS(i, j) = ai,j − AWDegS(i).
We can compute the weight of i ∈ S w.r.t. S as:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0005(3)
where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0006 refers to the set S without the element i. The total weight of the set S, W(S), is computed by summing up each weights ws(i). From this recursive characterisation of the weights we can obtain a measure of the overall similarity between a vertex i and urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0007 w.r.t the overall similarity among urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0008, and in [[22], [23]] characterises a set S as dominant set if it satisfies the following two conditions:
  • i. wS(i) > 0, for all i ∈ S.

  • ii. urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0009, for all urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0010.

It is evident from the definition that a dominant set satisfies the two basic properties of a cluster: internal coherence and external incoherence. Condition 1 indicates that a dominant set is internally coherent, while condition 2 implies that this coherence will be destroyed by the addition of any vertex from outside. In other words, a dominant set is a maximally coherent set of data.
The main result presented in [[22], [23]] establishes an intriguing one-to-one correspondence between dominant sets and strict local maximisers of the problem:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0011(4)
where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0012 is the standard simplex of urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0013. Moreover, in [[24]] it is proven that if S is a dominant set, then its weighted characteristic vector XS, defined below, is a strict local solution of the problem (4):
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0014(5)
Conversely, under mild conditions, it turns out that if x is a strict local solution of problem (4), then its ‘support’ S = {i ∈ V : xi > 0} is a dominant set. By virtue of this result, we can find a dominant set by first localising a solution of problem (4) with an appropriate continuous optimisation technique, and then picking up the support set of the found solution. In this sense, we indirectly perform combinatorial optimisation via continuous optimisation.
To find the local solution of the above quadratic problem (4), replicator dynamics from evolutionary game theory was used. Let A be a non-negative real-valued n × n matrix, the discrete version of the dynamics is defined as follows:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0015(6)
for i = 1, …, n. It is also possible to use a more efficient dynamics developed recently by Bulò et al. [[25]] which has a computational complexity that grows linearly in the number of vertices per step.

4 Multi-object tracking through dominant set clusters

The overall scheme of the proposed system is depicted in Fig. 1. Our tracking framework is composed of three phases. As a first phase, objects of interest need to be detected in each frame. For this paper and corresponding experiments in Section 5, we will refer to people as targets/objects of interest, but the approach described in the following can be easily adapted to different objects, supposed that a proper detector is available. In fact, any detection technique can be used, although in this paper we will use the well-known people detector based on histogram of oriented gradient (HOG) proposed in [[26]]. Once objects are detected as candidate bounding boxes, labels must be consistently assigned to different instances of the same person. To this aim, in the second phase we employ a ‘sliding temporal window’ approach that iteratively constructs tracklets using dominant set clustering. Finally, tracklets found along the whole video will be merged to form a trajectory over the whole course of the video. The next subsection will detail the last two phases, while readers interested in details about first phase can refer to [[26]].

Details are in the caption following the image

Overview of the proposed approach

4.1 Tracking using sliding temporal windows

To obtain more accurate results with more efficiency (i.e. lower latency) it is vital to use sliding temporal windows to track targets. Moreover, in the whole video there might be many targets most of which may not have temporal overlap. As a consequence, performing a global optimisation over the whole video frames is not only impractical, but also inefficient due to the huge search space and long latency, and also needless as many targets only appear for some period but not on the whole course of the video.

To generate tracklets over the sliding temporal window, we use overlapping neighbouring windows of frames. A toy example is given in Fig. 2 to ease the description of the approach. Given an input video and the detection responses found within the first sliding window, tracking is performed within the window. Then, the window is moved with the step of one frame at a time, so that the new window will have more than 90% overlap with the previous one, and the detection algorithm is applied on the new frame (number 4 in Fig. 2) so to generate detection responses. At this point, a low-level association algorithm (detailed in Section 4.2) is employed to associate the detection responses found in the last two frames (3 and 4) which have high appearance similarities and small spatial distances. Then, a global association algorithm (see Section 4.2) is applied over all the frames in the current window in order to create consistent tracklets. As a result, we are able to associate efficiently and accurately the new frame detection responses to the right tracklet.

Details are in the caption following the image

Toy example of sliding window approach. Bounding boxes denote detection responses, and the different colours represent different targets. The sliding window moves one frame per step. This figure is best viewed in colour

It is worth noting that selecting the size of the sliding window is crucial in many aspects. We carefully select the window size depending on both the dynamics and the frame rate of the video. In other words, if the dynamics of the video is very fast (meaning the objects are moving quickly), we consider smaller window size since detection responses across larger window size might have bigger distances, while they may still belong to the same target. Generally speaking, the smaller the window size we consider, the less position change made by the target and vice versa. In our experiments (reported in Section 5), we select a window size between 5 and 15 frames depending on the dynamics as well as frame rate of the video: the faster the frame rate, the bigger will be the size.

4.2 Tracklet generation in a single window

As described above, the generation of tracklets in each single window is obtained by applying a two-level association procedure, low-level and global.

4.2.1 Low-level association

Low-level association is performed to connect detection responses found in the last two frames (which have high appearance similarities and small spatial distances) into reliable tracklets (pair of responses). This will enable us to separate responses which have very small temporal differences and also distinguish between targets in complex scenes with similar appearances, since the people will not drastically change in appearance as well as position in just two consecutive frames.

Once extracted, the detection responses from the last two frames are represented as nodes on a graph (as described in Section 3), where edges define connections between them and edge weights reflect both spatial and appearance similarity between nodes. Let Gs = (Vs, Es, ωs) be our graph constructed using the detection responses of the last two frames where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0016 and the vertice urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0017 represents the jth node from the ith frame. We denote for the sake of simplicity with a and b the second last and last frames, respectively. The set of edges urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0018 is composed of edges which connect only nodes coming from diffrent frames, while ωs is the positive edge weight function.

As explained in Section 3, given n nodes, we can represent the graph with a n × n affinity matrix As = (aij) where aij = ω(i, j), if (i, j) ∈ Es, and ai,j = 0 otherwise. Using this representation, one association pair of responses of person i is simply a dominant set urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0019 in which urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0020 contains detection responses of only target i. Here the task is to find a pair of detection responses (if any) which has maximum similarity. For this task we utilised the DSC technique described in detail in Section 3.

4.2.2 Global association

Low-level association results are useful to initialise tracklet generation based on reliable associations (based on two consecutive frames). However, it may happen that some people are not visible in the last two frames of the current window. As a toy example, low-level association in Fig. 2c on frame 3 and 4 fails in creating an association for the pink person, since it is visible only in frame 4 and not frame 3. As a consequence, a global association algorithm ought to be used over the whole window. As a result, the tracklet generated in the previous window for the pink person (between frames 1 and 2 in Fig. 2b) can be successfully associated to the new detection in frame 4 (see Fig. 2d).

Similarly to the low-level association case, we denote the detection responses to be tracked in one temporal window of f frames as an undirected edge weighted graph Gn = (Vn, En, ωn) with no self-loop, represented as a similarity matrix An. One critical point in the global association algorithm is that the knowledge about both the previous window and the low-level association is exploited to update the new association matrix An. In particular, let urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0021 be the subgraph representing tracklet of person k with urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0022 vertices from the previous window. Then, the similarity matrix An = (aij) is computed as
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0023(7)
where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0024 represents tracklet of target k obtained from low-level association, urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0025 or urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0026 means both node i and j belong to same tracklet of kth target obtained from the previous window or low-level association, respectively. This results in the replacement of most of the values of our affinity matrix with the value 1, if the two nodes belong to same cluster according to results from previous window or low-level association, and with the value 0, otherwise. This will help us finding identical results (local optima) in two consecutive sliding windows.

To capture the individual similarities in people (patches) of the same target and differentiate between different targets, it is compulsory that the graph is made using a meaningful and robust similarity measure. A node urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0027 is represented with a location feature urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0028, which is the two-dimensional spatial coordinates of the bottom centre and upper left corner of the corresponding detection.

Moreover, we decided to model people appearance using covariance matrix feature descriptors [[15]]. This representation has been adopted in various approaches [[27]-[30]] thanks to its robustness in capturing information such as colour, shape and location and also to its scale and rotation invariance. Considering d different selected pixel features extracted independently from the image patch, benefiting from being a low dimensional representation, the resulting covariance matrix C is a square symmetric matrix d × d where the diagonal entries represent the variance of each feature and the non-diagonal entries represent the correlations.

Let us consider Im as a three-channels colour image and Y be the W × H × d dimensional image feature extracted from Im. Let Y(x, y) = ρ(Im, x, y), where the function ρ could be any mapping such as gradients, colour, filter responses, intensity, etc. Let {ti}i=1…M be the d-dimensional feature points inside Y, with M = W × H. The image Im is represented with the d × d covariance matrix of the feature points:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0029(8)
where vector μ represents the mean of the corresponding features of each point in a region R.

In our case, we decided to model each pixel within a people patch with its HSV colour values, its position (x, y), Gx and Gy which are the first order derivatives of the intensities calculated through Sobel operator with respect to x and y, the magnitude urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0030 and the angle of the first derivative urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0031. Therefore, each pixel of the people (patch) is mapped to a nine-dimensional feature vector: ti = [x;y;H;S;V;Gx;Gy;mag(x, y);Θ (x, y)]T. Based on this nine-dimensional feature vector representation the covariance of a patch is 9 × 9 matrix.

The distance between covariance matrices is computed using technique proposed in [[15], [31]] which is the sum of the squared logarithms of their generalised eigenvalues. Formally, the distance between two matrices urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0032 and urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0033 is expressed as:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0034(9)
where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0035 are the generalised eigenvalues of urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0036 and urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0037.
The weight of an edge between two nodes, urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0038, represents both appearance and spatial similarities (correlation) between detection responses and the distance between two nodes is computed as:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0039(10)
where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0040 (urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0041 and urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0042 representing the location feature vectors of the corresponding nodes) and α and β are values which are used to control contributions of appearance and location features, respectively. It is also important to underline that both γ(.,.) and d(.,.) are normalised between 0 and 1.
The affinity between the two nodes is defined as:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0043(11)
where σ is the regularisation term. The smaller the sigma, less likely are the detection responses to be associated, leading to fewer IDS but more fragments (interruption of the correct trajectory) and vice versa.

Our task of finding the tracklet of one target in one temporal window requires to identify a target in each frame and then to represent the feasible solution as a dominant set (subgraph) Gf in which at most one node (detection response) is selected from each frame's detection responses which are highly coherent and similar to each other. Therefore, we represent the dominant set as a subgraph Gf = (Vf, Ef, ωf). We ought to note that the feasible solution (tracklet) Gf only contains the detection responses of one target and not all visible detections found in the temporal window. In Fig. 3c we can see one feasible solution found from current temporal window of size 3.

Details are in the caption following the image

Illustration of tracklet generation in one temporal window

a Shows the possible detection responses in the window

b Shows the graph built between the detection responses

c Shows the tracking result from one iteration, containing dominant sets of one target as a subgraph

d Shows the obtained trajectory. This figure is best viewed in colour

By solving DSC for the graph Gn of the global association algorithm, we will end up with the optimal solution that is a dominant set (≡subgraph Gf) containing detections of one target, which corresponds to the feasible solution with the most consistent and highly coherent in both spatial and appearance features over the whole span of temporal window. Moreover, in order to find the tracklets of all the people found in a temporal window, the optimisation problem (4) needs to be solved several times. Therefore, at each iteration, the algorithm finds the tracklet which has the highest internal coherency. Then, vertices selected in Gf will be removed from Gn and the above optimisation process is repeated to find the tracklet for the next target, and so on until zero or few nodes remains in Gn. This approach is commonly referred as ‘peeling off strategy’.

4.3 Tracklet merging using DSC

If a person is occluded or exits from the scene for a period which is longer than the temporal window size, the association algorithm described in the previous subsection will not work properly, since a new label/id will be assigned to the person when he/she reappears in the scene. As a consequence, we need to merge those tracklets representing the same target into one single trajectory along different windows.

For this data association problem, we once again utilised the same DSC-based data association method. It is worth emphasising that merging different tracklets of one person on different windows is by far the hardest task. In fact, it is very common that the person's appearance is changed when he/she reappears in the scene, for both the possible changes in illumination conditions and the different poses with which the person re-enters in the scene. To solve this problem, we need a robust approach accounting for different situations which might arise like: target might appear to the scene heavily occluded and stay occluded for the most part of the tracklet; a target might enter in the scene with a different pose but then might get back to his/her original pose for the rest part resulting similar tracklets in their (average) appearances similarity etc. For this reason, it turns out that there exists no a single approach or similarity measure capable to handle all these situations carefully. Therefore, we borrow the idea from consensus clustering [[32]-[34]]. Consensus clustering (also known as clustering ensemble) combines different clustering techniques and able to exploit advantages from all of them and to handle all (or most of) the situations above.

First of all, we consider only tracklets of length greater than 10 nodes since very short tracklets were considered as false positives (FP). Then, three different approaches are used concurrently:
  • i. In the first approach, each tracklet is divided in two equal parts: each tracklet partition is represented as a node in a graph and the weight between the nodes (tracklets) is computed as a mean distance between their appearance features. Let I and J be two tracklets (≡nodes), then the distance between the two tracklets is formulated as:

    urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0044(12)
    The similarity matrix is built as urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0045 where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0046. This approach is adequate in merging two tracklets (of the same person) where the target appears more or less with the same appearance in both tracklets for most of the time. Since it exploits the mean distance over their appearance similarity, the approach is prone to few big changes made in appearance. However, the approach fails in cases where the target appears in a totally different pose for most of the considered frames.

  • ii. The second approach again divides each tracklet in two equal parts, represented as nodes in a graph. However, in this case the weight is computed by taking the minimum of their mean distances on their appearance features only:

    urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0047(13)
    The similarity matrix is built as: urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0048 where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0049.

    Unlike the previous technique this approach works best in merging tracklets where the target makes a big change in appearance and stayed the same for the most part of the tracklet. In such cases, taking the minimum distance between their appearances is the best choice, supposed that there exist at least some frames/nodes where the appearance is similar. This approach will come short in cases where most of the targets have high appearance similarity between each other, causing IDS.

  • iii. The third approach represents each tracklet by means of their detection with highest weighted characteristic vector value (strict local solution of the optimisation (4)). In other words, the best representative detection of each tracklet is selected for comparing with the other tracklets. The distance between the nodes of the new graph will be computed by taking their difference in appearance. More specifically, let us consider a node i in a tracklet I, i ∈ I, to be that with highest characteristic vector value:

    urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0050(14)
    where w has been defined in (3) and (5). The similarity matrix is built as: urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0051 where urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0052.

In cases where the two separate tracklets are created due to occlusion between people which lasts longer than the size of the temporal window, it is less likely that the person will make a big change on his/her new appearance. Hence, it will be enough to represent a tracklet with two representative detection responses with the highest characteristic vector values. However, yet again this technique fails if the selected representative detection is a bad representative (e.g. it can be partially occluded), generating multiple IDS.
DSC is performed three times, each using one of the different approaches listed above (Aμ, AM and AC). The clustering results are then combined in to a single affinity matrix according to the evidence accumulation paradigm [[35]]. We built a matrix B = (bij) known as co-association matrix, where bij = 0 if i = j, otherwise bij = φ(i, j), being φ(i, j) the fraction of times object i and j are clustered together compared with all the clustering results C in the ensemble:
urn:x-wiley:17519632:media:cvi2bf00251:cvi2bf00251-math-0053(15)
where Cr(i) is the label of ith object in the rth clustering result and δ(n, m) is Kronecker's delta function, that is, δ(n, m) = 1 if n = m and 0 otherwise.

To get the final clustering result, we run the DSC on the co-association matrix.

5 Experiments

5.1 Experimental setup

We evaluate our approach on three publicly available data sets: sequence S2L1 from PETS2009 benchmark [[1]], TUD-standemitte dataset [[2]], and ‘sunny day’ sequence from ETH mobile dataset [[3]], which are commonly used by previous multi-target tracking works. To be fair and transparent on the comparison, we used publicly available detection responses [[36], [37]] and ground truth (GT) [[38]] in all our experiments. We also used publicly available code [[39]] for the evaluation.

5.1.1 Brief description of the datasets

PETS2009-S2L1-View one [[1]] consists of 795 frames and comprised of challenges like non-linear target motion, targets in close proximity and several occlusions in the scene.

TUD-Stadtmitte [[2]] consists of 179 frames recorded in a busy pedestrian street with low camera angle, which generates frequent occlusions. Both these datasets are recorded using static cameras.

On the contrary, ETH dataset (‘sunny day’ sequence) [[3]] is recorded with a pair of moving cameras in a busy street scene. The cameras move forward and have a panning motion at times, which potentially makes the exploitation of known motion models less reliable. Most targets move with a speed of 2–4 pixels per frame, whereas the camera exhibits a motion of around 20 pixels per frame. This causes imprecise results while using linear motion estimations. However, the relative positions of the targets do not change between two consecutive frames since all targets move according to the same camera motion. The sequence contains 354 frames and people size detected on image plane varies significantly. Similar to [[40]], we used the sequence from the left camera. No ground plane information is used.

5.1.2 Parameter analysis

There is no fixed window size which will work for all the datasets, rather depending on the dynamics and frame rate of the video. Therefore, we set window size of 15, 10 and 5 for PETS2009-S2L1, TUD-standemitte and ETH-Sunnyday sequences, respectively. Our algorithm performs well in a wide range of σ (the scaling parameter in (11)) from 0.2 to 3.0. The good invariance (or low sensitivity) to σ parameter is due to the exploitation of results from both the previous window and the low-level association along the current window to update our affinity matrix, which results in the replacement of most of the values of our affinity matrix (with the value 1, if the two nodes belong to same cluster according to results from previous window or low-level association and with the value 0 otherwise).

For PETS2009-S2L1 and TUD-standemitte, α and β in (10) (which are factors for controlling the contributions of appearance and position information, respectively) are typically set to one which will give equal weight to appearance and position features. However, based on our experiments, the appearance features are more informative than position in our formulation. In fact, in many cases using appearance features only is sufficient for tracklet generation. In the case of ETH-Sunny day sequence, we set α and β values to 1 and 1.25, respectively to get the best result. In fact, since this sequence is recorded with a moving camera, large changes in poses and sizes of the people are present and this results in significant changes in their appearance. Consequently, appearance information is less informative than positions in this case.

5.1.3 Evaluation methodology

The correct and fair evaluation of multi-target tracking systems relies mainly on two issues: the definition of proper and agreed evaluation metrics; and the use of publicly-available GT on which the evaluation can be based.

Regarding the evaluation metrics, we adopted those defined in [[39]] which uses the most widely accepted protocol CLEAR MOT [[41]]. CLEAR MOT defined several values:
  • Multi-object tracking accuracy (MOTA) combines all error types [FP, false negatives/missing detections (FN) and IDS] – the higher the better.

  • Multi-object tracking precision (MOTP) measures the normalised distance between tracker output and GT location, i.e. the precision in the bounding box (or centre of gravity) localisation – the higher the better.

  • Mostly tracked measures how many GT trajectories are successfully tracked for at least 80% of frames – the higher the better.

  • Mostly lost (ML) measures how many of the GT trajectories are tracked for <20% of the whole trajectory – the lower the better.

  • Identity switches number of times that a tracked trajectory switches its matched GT identity – the lower the better.

In addition to those metrics, we also compute the following values:
  • False alarm per frame (FAF) measures the average false alarms per frame – the lower the better.

  • Precision (Prcsn) is the average of correctly matched detections per total detections in the tracking result – the higher the better.

  • Recall (Rcll) computes the average of matched detections per total detections in the GT – the higher the better.

All the reported results in Tables 1 and 2 are generated using tracking outputs provided by the authors [[39]] with the same overlap threshold for CLEAR MOT metrics. Instead, quantitative results of the compared approach reported in Table 3 are taken from [[40]].
Table 1. Tracking results on PETS2009-S2L1 sequence. For all approaches the number of GT trajectories is the same (19)
Method MOTA, % MOTP, % Rcll, % Prcsn, % FAF MT ML IDs
CVPR 2011 [[20]] 81.8 71.5 89.2 96.6 0.19 17 0 97
ECCV 2012 [[42]] 53.4 70.9 72.6 80.3 1.04 7 1 65
CVPR 2013 [[43]] 90.8 74.2 97.1 94.4 0.34 18 0 26
CVPR 2015 [[39]] 87.9 64.5 98.6 90.8 0.59 19 0 29
Ours/DSC 90.0 56.8 91.7 98.5 0.08 17 0 15
Table 2. Tracking results on TUD-Stadtmitte sequence. For all approaches the number of GT trajectories is the same (10)
Method MOTA (%) MOTP (%) Rcll (%) Prcsn (%) FAF MT ML IDs
CVPR 2011 [[20]] 67.6 72.6 75.0 92.9 0.37 5 0 96
ECCV 2012 [[42]] 60.0 56.5 83.7 78.7 1.46 7 0 12
CVPR 2013 [[43]] 53.5 72.4 79.1 82.0 1.12 7 0 20
CVPR 2015 [[39]] 69.7 53.4 74.7 94.3 0.29 6 0 4
Our/DCS 72.4 52.6 75.1 99.8 0.01 6 0 10
Table 3. Tracking results on ETH-Sunny day sequence. For all approaches the number of GT trajectories is the same (30)
Method MOTA, % MOTP, % Rcll, % Prcsn, % FAF MT ML IDS
CVPR2011 [[40]] 77.9 86.7 0.65 22 3 1
Ours/DSC 61.5 66.8 71.4 89.4 0.45 19 3 8

Regarding the GT, in order to provide a common ground for further comparison, we used the publicly-available detection responses from [[37]] for PETS2009 and TUD datasets and those from [[40]] for ETH. As GT, for all the datasets we used that provided in [[38]]. It is worth noting that we achieved exactly the same results even when we used modified (stricter) GTs, assuring that all the people have one unique ID troughout the whole video. In fact, it often happens that when a person disappears from the scene (due to occlusions or because he/she exits temporary from the scene), when he/she reappears a different ID is assigned (also in the public GT). The stricter GT, instead, assigns the same ID, which is not always the case in [[38]].

5.2 Results on static camera videos

Let us first introduce the results in the two most used datasets for multi-target tracking (PETS2009-S2L1 [[1]] and TUD-Stadtmitte [[2]]) which are recorded with static cameras.

Our quantitative results on PETS2009-S2L1 are shown in Table 1. Compared with up-to-date methods, our DSC tracker attains best performance on MOTA, precision, IDs and FAF over the majority of the approaches in Table 1, while keeping recall and MT comparable.

Some visual results are presented in Fig. 4, first two rows (first row shows our results, while second row reports the GT). Even if targets appear quite close and similar in appearance, our approach is still able to maintain correct IDs, as in the case of targets 3 and 7 on frame 149 or targets 3 and 8 on frame 149 and 161. Furthermore, thanks to our robust tracklet merging step, our approach is capable to correctly identify a target on his/her reappearance after a long absence from the scene, as in the case of targets 1, 3, 4, 5 and 6 on frame 716. Please note that the GT (second row of Fig. 4) contains different IDs when the targets reappear in the scene (see also the comment reported above about the GT).

Details are in the caption following the image

Tracking result on PETS2009-S2L1 and TUD-Stadtmitte datasets. First two rows refer to PETS2009-S2L1 (first row represents our results, while second row represents the GT), wherease the last two rows refers to TUD-Stadtmitte (with third row showing our results, while fourth showing the GT). This figure is best viewed in colour

Quantitative results on TUD-Stadtmitte dataset are provided in Table 2. Our method attains superior results in MOTA, precision and FAF while remaining comparable in ML and IDs. However, our approach obtains relatively lower performance in recall: this is mainly because DCS tracker focuses on assigning the right person ID to detection responses generated by the people detector, i.e. no motion model (linear or any other) for prediction of next locations has been used. As a result, our method generates a higher number of FN.

Fig. 4 shows some visual tracking results on TUD-Stadtmitte sequence (third and fourth row). Our approach is able to maintain correct IDs, as in the case of targets 4 and 8 on frame 61 or targets 10 and 11 on frame 104, regardless of their similarity in position and appearance and the severe occlusions.

5.3 Results on moving camera videos

Table 3 shows the results on ETH-Sunny day sequence recorded from moving cameras. Compared with [[40]], our DSC tracker achieves best performance on precision and false alarm per frame, while having relatively higher number of IDS. This is mainly due to the appearance and size of targets which highly vary along with camera movement. However, the higher precision of our approach shows that it is able to recover the correct IDs. Fig. 5 shows few visual results of our method (first row shows our results, while second row represents the GT). Even if people are fully occluded for a long time and reappear, our approach is able to correctly re-identify the person, as in the case of target 5 on frame 79, 98 and 316. We ought to note that the GT (second row of Fig. 5) gives different IDs when the target reappear after a long-lasting occlusion, as in the case of targets 8 and 29 on frame 79 and 316, respectively.

Details are in the caption following the image

Sample Tracking result on ETH-Sunny day sequence (first row represents our results, while second row represents the GT). This figure is best viewed in colour

6 Conclusions and future works

In this paper, a dominant set clustering based tracker is proposed, which formulates the tracking task as finding dominant set (cluster) on the constructed undirected edge weighted graph. The development of a complete multi-target tracking using dominant set clustering is the main contribution of this paper. We utilised both information (appearance and position) for data association in a global manner, avoiding the locally-limited approach typically present in previous works. Experimental results compared with the state-of-the-art tracking approaches show the superiority of our tracker, especially in terms of MOTA and precision, as well as lower IDS in some cases. Generally speaking, the tradeoff between precision and recall in tracking is hard to balance, thus resulting in lower performance on one of the two values. Our claim is that precision is more relevant for tracking purposes and that assigning the right ID to targets (people) when they re-appear after an occlusion or the exit from the scene should be the key objective of tracking systems.

Regarding the efficiency of the proposed approach, for a window size of 10 frames with approximately 10 pedestrians, processing a frame using the full proposed framework, excluding human detection, takes an average time of 1.6 and 0.003 s for affinity matrix generation and tracklet generation, respectively. Moreover, it takes only 0.001 s for tracklet merging step. These values are computed by running a non-optimised Matlab code on a core i5 2.5 GHz machine. Using an optimised parallel implementation in C, the algorithm is likely to work in real time.

As future directions of research, we first would evaluate our approach on other datasets with moving camera. In fact, the current results on ETH dataset suggest that further validation of the robustness of our approach to identity switched in this type of datasets is required. Moreover, we would extend the methodology to multi-camera setups. As a feeling, we believe that our approach will be straightforward to extend multiple cameras, since no motion or temporal information are used in it. Another possible future work consists in evaluating different similarity measures as well as to consider different types of targets instead of people.