Dual-scale weighted structural local sparse appearance model for object tracking

: It is a great challenge to develop an effective appearance model for robust visual tracking due to various interfering factors, such as pose change, occlusion, background clutter etc. More and more visual tracking methods tend to exploit the local appearance model to deal with the above challenges. In this study, the authors present a simple yet effective weighted structural local sparse appearance model, which can better describe the target appearance information through patch-based generative weight. To further improve the robustness of tracking, they implement this appearance model on two-scale patches. The two derived appearance models are then combined to form a collaborative model to play their advantages. Extensive experiments on the tracking benchmark dataset show that the proposed method performs favourably against several state-of-the-art methods.


Introduction
Computer vision, which uses computer to simulate human vision system to perceive, understand and analyse the surrounding environment, has a wide range of applications, such as scene labelling [1], road detection [2], automatic recognition [3,4], image labelling [5], person re-identification [6], human pose recovery [7,8] and so on. Visual tracking, being one of the fundamental topics of computer vision, plays a critical role in numerous lines of research including video surveillance, vehicle navigation, motion analysis, aeronautics, and astronautics. Although significant progress has been made in the past few decades, visual tracking technology is still a challenging problem due to the numerous uncertain factors in the tracking process, such as partial occlusion (OCC), illumination change, background clutter, viewpoint variations etc.
Generally, an online tracking algorithm includes three main components: (i) a motion model, which models the temporal consistency of the states of the target and produces a number of candidate targets based on the target of the current frame for forecasting the target in the next frame; (ii) an appearance model, which is used to represent the tracked target in the tracking process; (iii) an observation model, which evaluates the likelihood of each candidate target being the true target and selects the best candidate as the tracking result. This study focuses on the appearance model, which is the core of a tracking algorithm.
Generative models formulate the tracking as a problem searching for the image observation with minimal reconstruction error or the most similar to the target model. In generative tracking methods, the object appearance representation is very important. Many representation schemes have been proposed, including template-based (see [11,15,28]), subspace-based (see [12,29]), and sparse representation-based (see [13,[30][31][32]) models. In the template-based algorithms, colour histogram [33] and pixel intensity [34] are used to model object appearance for visual tracking. In recent years, subspace-based algorithms have been widely studied. Ross et al. [12] propose an incremental visual tracking (IVT) method, which represents the tracked target by using a low-dimensional principal component analysis (PCA) subspace and online updates the PCA subspace to capture the appearance change of the target. Motivated by the work of Wright et al. [35], some researchers introduce sparse representation into visual tracking and achieve great success. Mei and Ling [13] adopt a holistic appearance model of the target and track the target by solving a l 1 minimisation problem. Liu et al. [30] propose a method, which exploits a local appearance model based on histograms of sparse coefficients and the mean-shift algorithm for tracking.
In this study, we present an efficient tracking algorithm based on a dual-scale weighted structural local sparse appearance (WLSA) model. The proposed method fully takes into account the similar local structural and inner geometric layout information of the target in an image sequence. Local patches in a fixed spatial layout within a target region are extracted and encoded by a dictionary established beforehand. In the local appearance model, all individual patches are stacked together to form a complete representation of the target, which means each local patch can describe some extent of the target appearance information. Thus, patch-based weights representing the importance of local patches in describing the target appearance information are computed through the structural reconstruction error. Based on the patchbased weight and alignment-pooling method, the coding coefficients of the patches are integrated to obtain a robust representation of a target object. This helps locate the target accurately and handle partial OCC effectively. Combining spatial and structural information of different scales can make representation more discriminative and robust, so the final likelihood of one candidate region is estimated based on the association of patches extracted with two different scales. The contributions of this work are summarised as follows: i. We propose a simple yet effective weight calculation method to help local patches better describe the target appearance information. ii. A novel WLSA model is presented for robust representation. iii. To further improve the tracking performance, the proposed WLSA model is implemented in two different patch scales. iv. The proposed method is evaluated on a large benchmark dataset and achieves competitive results compared with some state-of-the-art methods.
The rest of this work is arranged as follows. In Section 2, we briefly introduce the related work. In Section 3, we describe the proposed method in detail. Section 4 presents the experimental results of the proposed method and some state-of-the-art methods. Finally, a conclusion about this work is given in Section 5.

Related work
Sparse representation has been widely studied and successfully applied in the field of visual tracking [13,[30][31][32][36][37][38][39][40]. With sparsity constraints, a signal can be represented as a linear combination of only a few basis vectors. In [13,31], a candidate target is sparsely represented by a linear combination of the atoms of a dictionary composed of target templates and trivial templates. The sparse representation problem is then solved through l 1 minimisation with non-negativity constraints. In [36], dynamic group sparsity constraints of spatial and temporal adjacency are incorporated to improve the efficiency and robustness of tracking algorithm. Liu et al. [30] employ a local sparse representation scheme to model the target appearance and use the sparse coding histogram to represent the basis parts of the target. Zhang et al. [40] propose a multi-task tracking algorithm, which exploits the interdependency between candidate regions based on joint group sparsity constraints. Jia et al. [41] extract local patches in a fixed spatial layout and adopt an alignment pooling method across the local patches to obtain an adaptive structural local sparse appearance (ASLA) model for robust tracking. Zhao et al. [42] and Xie et al. [43] construct structural local sparse appearance models on patches of two scales and seven scales, respectively, and then associate these models for effective tracking. Our method is inspired by the work in [41][42][43]. First, we propose a novel object representation method, namely our WLSA model. For the purpose of increasing the tracking performance, the proposed WLSA model is then carried out on local image patches of two scales and constructs two local appearance models. Finally, we combine these two local models as a dual-scale appearance model by a collaborative mechanism.

WLSA model
Most tracking methods use either a holistic template or local appearance model to represent the target. In this work, a local sparse representation is developed to model the appearance of target patches and represent them with the corresponding sparse coefficients. Given an object image, we can use a sliding window with a fixed size to extract a set of local image patches with a spatial layout. If we have a set of target templates T = [T 1 , T 2 , …, T n ], we extract local patches from each target template in the same way. Then a dictionary D = [d 1 , d 2 , …, d (n × N) ] ∈ R q × (n × N) to encode local patches inside the possible candidate regions can be obtained, where q is the dimension of the image patch vector, n is the number of target templates, N is the number of local patches sampled within each target template. Each column of the dictionary D is obtained by l 2 normalisation on a vectorised image patch extracted from T. The first n frames are tracked using the nearest neighbour searching algorithm [44] and tracking results (warped to a canonical size of 32 × 32) are used to form the target template set T. Similarly, we extract local patches within a candidate target and turn them into vectors in the same way, which are denoted as Y = [y 1 , y 2 , …, y N ] ∈ R q × N . Then, the following three stages are performed as shown in Fig. 1.

Sparse coding:
With the dictionary D, each patch y i can be sparsely represented as a linear combination of the atoms of the dictionary by solving where ||·|| 2 and ||·|| 1 denote the l 2 and l 1 normalisation, respectively, λ is the regularisation parameter, b i ∈ R (N × n) × 1 is the sparse code of the patch y i , and b i ≥ 0 means all the elements of b i are non- represents the sparse codes of N local patches within the candidate region.

Accumulation and weighting:
According to the target templates, the sparse coefficient vector of the patch y i is split into n (the number of the target templates) segments, i.e.
where vector v i corresponds to the patch y i and C is a normalisation term, In local appearance model, each local patch represents a fixed part of the target object. Hence, each patch can describe the different extent of appearance information of the target, i.e. each patch has a different importance. Here, we introduce a set of weights to embody the importance of patches. After obtaining the sparse code b i of the patch y i , we reformulate the sparse code b i as where where ω i is an indicator vector, ⊙ is the element-wise multiplication In order to make full use of the sparse code b i , a penalty term ∥ D 1 − ω i ⊙ b i ∥ 1 is added to (4), so the reconstruction error ε i can be rewritten as follows: where γ is a control parameter. If the candidate is perfect, the local patch y i can be represented well by the sub-dictionary D i . At the same time, the penalty term will be very small, otherwise very large.
The weight of the patch y i is calculated by Fig. 2 shows the calculation process of patch-based weight. Then, we use the following formula to perform the weighting operation: .

Alignment pooling:
The candidate target contains N local image patches and all the vectors z i of it can form a square matrix Z, Z = [z 1 , z 2 , …, z N ]. According to the spatial layout of the target, the appearance variation of a local patch can be best described by the patches at the same position of the target templates (i.e. using the sparse coefficients with the aligned position). As in [41], we take the diagonal elements of the square matrix Z as the pooled features of the local patches of the candidate region .
where f is the vector of pooled features, v ii − reflects the structural similarity between the ith patch of the candidate target and the ith patch of the target object. After the above three stages, the likelihood of the candidate target being the true target is measured by Different from the model in [41], which utilises the same weight for each patch, the proposed method gives a less weight to the contaminated patch and a more weight to the reliable patch, thereby mitigating the influence of noise introduced by target appearance variations.

Dual-scale WLSA (DSWLSA) model
To further improve tracking performance, we adopt representations of two different patch sizes to model target appearance. We extract local patches of different sizes within target templates and construct dictionaries at different sizes, i.e.
is the jth column representing a vectorised and l 2 normalised image patch on the sth scale, q s is the dimensionality of image patches on the sth scale, N s is the number of local image patches on the sth scale. Let be the vectorised image patches of different scales extracted from a candidate target, where y i s ∈ R q s × 1 is the ith local patch under the sth scale. According to the first stage of Section 3.1, we generate sparse coefficients for each local patch. With the sparsity assumption, each local patch y i s can be encoded with the dictionary D s by solving  (7) and (8). In the third stage, the vector of pooled features under the sth scale can be obtained by The detailed process of the DSWLSA model can be seen in Fig. 3.

Collaborative tracking model
In this study, object tracking is carried out within the particle filter framework. Let affine parameters x t = (l x , l y , μ 1 , μ 2 , μ 3 , μ 4 ) represent the target state where l x and l y denote the horizontal and vertical translation, μ 1 denotes the rotation angle, μ 2 and μ 3 are scale and aspect ratio, and μ 4 is the skew parameter. Given the observation set of target o 1: t = {o 1 , o 2 , …, o t } up to frame t, the current target state x t can be obtained by the maximum a posteriori estimation where x t i indicates the ith sample at frame t. Based on the Bayes theorem, the posterior probability p(x t i | o 1: t ) can be estimated recursively by where describes the state transition of the target between consecutive frames and is often formulated where Σ is a diagonal covariance matrix whose elements are the variances of affine parameters.
The observation model p(o t | x t i ) describes the likelihood of the observation o t at state x t i belonging to the target class. It plays a key role in robust tracking. In our method, the observation model is constructed as where the first term of the right side of the equation denotes the similarity between the candidate and the target on the first scale, the second term of the right side of the equation denotes the similarity between the candidate and the target on the second scale, η is a ratio factor for balancing the contributions of two different scale models.

Template update
The appearance of the target often changes significantly during the tracking process, because there are various interferences such as illumination change, background clutter, and so on. Visual tracking with fixed templates cannot adapt to the appearance variation of the target and is prone to fail. In contrast, if we update the template too frequently with new observations, errors are likely to accumulate, which will cause the tracker to drift away from the target. Many methods have been proposed for template update [12,13,31,41,45]. The authors of [13,31] give each target template in the template set a weight, i.e. indicative of the importance of each template representing tracking results. The target template with the smallest weight is replaced by the current tracking result. For this update scheme, the template set is easily updated by a polluted template and cause tracking failure. Ross et al. [12] propose an incremental PCA algorithm to update the eigenbasis and the mean vector as new observations arrive. However, the PCA is sensitive to partial OCC. Jia et al. [41] introduce subspace learning into sparse representation and reconstruct a new template for replacing an old template, which reduces the influence of the occluded target template. In this work, we update PCA subspace (including PCA basis and the mean vector) and target template similarly to [12,41]. After an estimated target x is obtained, it can be modelled by a linear combination of PCA basis vectors and additional trivial templates where U is the matrix of PCA basis vectors, a is the coefficients of PCA basis vectors and e indicates the corrupted or occluded pixels in x. As the error caused by OCC and noise is arbitrary and sparse, the problem in (16) can be solved as l 1 regularised least square problem where H = [UI], c = [ae] T and λ is the regularisation parameter. Then, we reconstruct a new template by The new template T new is used for updating the target template set T. To do this, we generate a cumulative probability sequence

Experimental setup
The proposed algorithm is implemented in MATLAB and runs at 0.5 frames per second on an Intel Core i7-4770 CPU (3.4 GHz) machine. The number of templates n is 10. The l 1 minimisation problem is solved with the SPAMS package [46]. The regularisation parameter λ and variable γ are, respectively, set to 0.01 and 0.01. For the motion model, the variances of the affine parameters are set to (l x , l y , μ 1 , μ 2 , μ 3 , μ 4 ) = (6, 6, 0.01, 0, 0.005, 0). The number of samples for particle filter is chosen to be 600. In our experiments, the target and each candidate are warped to 32 × 32 pixels. Local image patches of two different scales are extracted within the target and each candidate region with 8 pixels as step length. The first size is 16 × 16 and the second size is 8 × 8.
For the template update, ten eigenvectors are used to implement incremental subspace learning and the update frequency ν is set as 5. In our approach, ratio factor η in (15) is set to 0.2. All the parameters are fixed for all the experiments.

Quantitative evaluation:
Two traditional metrics are used to evaluate the above-mentioned methods. The first metric is the centre error, which is defined as the Euclidean distance between the centre locations of tracked objects and their corresponding labelled ground truth. The second one is the overlap rate, usually defined as score = ((area(R T ∩ R G ))/(area(R T ∪ R G ))), where R T is the tracking bounding box and R G is the ground truth bounding box. Table 1 reports the average centre errors in pixels, where a smaller average error means a more accurate result. From Table 1, we can see that the proposed tracker ranks within the top three on 20 out of 29 test sequences. Table 2 presents the average overlap rates, where a larger value means a more accurate result. Table 2 shows the proposed tracker ranks within the top three on 23 out of 29 test sequences. From Tables 1 and 2, the proposed tracking algorithm achieves very excellent performance on the most of the test sequences. However, there are outliers (nine sequences in Table 1 and six sequences in Table 2). The reason may be twofold. First, we employ a fixed instead of adaptive ratio factor for connecting dual scales in the proposed method, which compromises the performance of the proposed method on some sequences. Second, the proposed method is not robust for low resolution (LR), motion blur (MB), fast motion (FM) and out of view, which can be seen from the quantitative results of 'dudek' and 'boy' sequences. Fig. 4 presents the tracking results in several sequences with IVs. In the fish sequence, the target undergoes obvious light changes. At the same time, the camera moves fast. The DLT method cannot perform well from the start of the sequence (e.g. #20). SCM and VTD methods perform unstably and show deviation away from the target position during the tracking (see #360 and #420). In contrast, the other seven methods successfully track the target from the start to the end. For the Sylvester sequence, the Struck, TGPR and our DSWLSA method perform better than other methods and achieve more accurate tracking results. The target in the doll sequence undergoes a long time scale and illumination changes. The VTD, KCF, TGPR, Struck and TLD methods cannot deal with SV well (see #940, #1950 and #3872). The DLT, SCM, IVT and ASLA methods cannot accurately track the target finally (see #3872). Our DSWLSA method precisely tracks the target till the end. In the car4 sequence, the target undergoes large IV when the car passes through a bridge. Most trackers are able to track the target at all frames. However, the VTD method drifts off the target after the car passes through the bridge (see #240, #400 and #500). The TLD method demonstrates a large deviation when the target undergoes a drastic illumination change (e.g. #240). In the cardark sequence, the TLD, VTD, DLT and IVT methods fail to track the target at the end (seen from #393). The KCF method exhibits a light deviation in the tracking process (seen from #200, #240 and #393). The ASLA, Struck, SCM, TGPR and our DSWLSA method precisely lock the target for the whole sequence.

Qualitative evaluation: IVs:
Rotation: Fig. 5 shows some sampled results in five sequences with rotation. In the david2 sequence, the head of the man swings randomly. Except for DLT, the other methods can overcome the challenge of rotation and achieve good performance. In the dog1 sequence, the target experiences rotation and SV. We can see that our DSWLSA tracker performs best in this sequence. In the girl sequence, the target goes through 360° out-of-plane rotation (OPR). The TLD method loses the target at frame #100, but it recaptures the correct target again in the subsequent frames by using a detector. The DLT method drifts away and the KCF, TGPR methods lock onto another person at the end (seen from #500). The IVT method fails to track after 200 frames, which can be seen from #240, #350 and #500. The VTD method deviates away from the target when the girl rotates her head (e.g. #240 and #350). The ASLA, Struck, SCM and our DSWLSA tracker persistently track the target throughout the whole sequence. In the carscale sequence, the target meets with rotation and OCC. The ASLA, IVT, DLT and our DSWLSA method perform better than other methods. The boy sequence is challenging due to many factors: FM, MB, as well as rotation. The ASLA, SCM and IVT trackers perform poorly (seen from #340, #500 and #602). Unlike the mentioned three methods, the other seven methods robustly win multiple challenges and steadily track the target to the end.
OCC: For visual tracking, OCC is one of the most common and critical challenges. Fig. 6 illustrates tracking results from six challenging sequences where the targets are severely or long-term occluded. In the faceocc1 sequence, the woman occludes her face with a book frequently. Except for the ASLA method, the other methods are able to track the target to some extent but the SCM, TGPR and our DSWLSA method perform favourably against other methods. The target in the faceocc2 sequence undergoes heavy OCC. Most trackers can successfully track the target from the start to the end. However, the ASLA method drifts to the background at the end (see #812). In the jogging2 sequence, a person is almost completely occluded by the lamp post for a short term. The ASLA, DLT, Struck, VTD, KCF and IVT methods cannot obtain the target again and undergo large drift after the person goes across the obstacle (see #100 and #150). The TLD method locks to another person at frame #100, but it can re-detect the target in the subsequent frames. The TGPR, SCM and our DSWLSA method perform well and precisely track the person to the end of the sequence. In the suv sequence, the target is heavily occluded by dense tree branches and another bus. The IVT and Struck methods are sensitive to OCC and cannot track the target properly (see #95 and #550). The ASLA method fails to track the target from frame #700 to the end. The TGPR, VTD and TLD methods lose the target at frame #550, but the TLD method is able to reacquire the target and track the target to the end. The SCM, KCF, DLT and our DSWLSA method successfully track the target throughout the entire sequence. In the woman sequence, the person is partially occluded by cars for a long time. The KCF, TGPR, SCM, Struck, DLT and our DSWLSA method perform well and obtain satisfying results. In the walking2 sequence, the walking woman undergoes a long-term OCC by a man together with SV. The TLD, VTD, ASLA and KCF methods severely drift away when the woman reappears again (e.g. #240). The TGPR and Struck methods cannot handle SV well (e.g. #400 and #500). The SCM, IVT, DLT and our DSWLSA method perform better compared with other methods and achieves more accurate results.
SVs: Fig. 7 shows some representative results on five sequences with SVs. The singer1 sequence is very challenging as the target moves far away from the camera with a large-scale change. Furthermore, the stage light changes drastically and the background is cluttered. The TGPR method performs poorly when significant IVs occur (e.g. #140, #200 and #350). The IVT method slightly deviates from the target position (e.g. #140, #200 and #250). The Struck, KCF and VTD methods fail to track the SV of the target. Unlike the above methods, the ASLA, TLD, SCM, DLT and our DSWLSA method perform very well in this sequence. For the walking sequence, the ASLA, IVT and our DSWLSA method achieve better performance. In the freeman1 sequence, a person undergoes a large SV and rotation in his face appearance. The DLT, KCF, ASLA and TLD methods cannot track the target from frame #150 to the end. The IVT method loses the target finally (e.g. #326). The VTD, TGPR and Struck methods perform unstably and drift away from the target in the tracking process (e.g. #90 and #150). The SCM and our DSWLSA method can track the target well. In the freeman3 sequence, a person moves towards the camera with a large SV in his face. Along with the scale, changes are pose variation and LR, which make the tracking task more difficult. The VTD, IVT and Struck methods drift to the background regions (e.g. #400 and #460). The TGPR method fails to track the target from the beginning of the sequence (e.g. #50). The TLD method cannot stably track the target and loses track of the target, which can be seen from frames #140, #400 and #460. The KCF method unsuccessfully keeps track of the target at the end (e.g. #460). The ASLA, SCM, DLT and our DSWLSA method successfully track the target throughout the entire sequence but our DSWLSA method performs best.
Background clutters (BCs): Fig. 8 shows the performance comparison of all the trackers in handling the challenge of background clutter on four sequences. In the subway sequence, the DLT method cannot track the target correctly from the start of this sequence (see #20). The VTD, ASLA, IVT and TLD methods fail to track the target and drift away to the background when the tracking person is occluded by another walking person (e.g. #42). The SCM, Struck, TGPR, KCF and our DSWLSA method precisely track the target to the end. In the football sequence, the scene is cluttered. The similarity between the target and the background makes the tracking more challenging. Only the TGPR and our DSWLSA method robustly track the target at all frames. The trellis sequence contains many challenging factors, such as background clutter, OPR, and IV. The TLD and VTD methods perform unstably and some of their tracking results shake away frequently (see #400 and #500). The DLT and IVT methods fail to track the target after 200 frames (e.g. #270, #400 and #500). The ASLA method does not perform well at frame #400. The SCM and our DSWLSA method lock the target more stably and obtain better accuracy than the Struck, TGPR and KCF methods. For the dudek sequence, all the methods carry out well in this sequence. In the MountainBike sequence, the target meets with background clutter as well as rotation. Except for the TLD method, the other nine methods are able to reliably track the target to the end.
Deformation (DEF): In Fig. 9, sequences skating1, mhyang, david and crossing are selected to evaluate the robustness of trackers against non-rigid DEF. In the skating1 sequence, the dancer continuously changes pose. At the same time, the target suffers from drastic illumination change and background clutter. The IVT method performs poorly (see #50, #150 and #200). The TLD is unable to persistently locate the target (e.g. #150, #280 and #310). The TGPR, ASLA, Struck, SCM and DLT methods severely deviate from the target in the end (see #310). The VTD and our DSWLSA method successfully track the target and achieve the most stable performance. In the mhyang sequence, the ASLA, DLT and our DSWLSA method perform better than other methods and achieve more excellent tracking results. In the david sequence, a person moves from a dark room to a bright area with a large IV and he changes the orientation of his face from time to time. The DLT and Struck methods fail to lock the target (see #459, #599, #649 and #770). The ASLA, SCM and our DSWLSA method achieve  the best performance. For the crossing sequence, non-rigid DEF, scale, and IVs are the main challenges. The TLD and VTD methods lose the target (see #80 and #120). The IVT method is less effective for this sequence (e.g. #60, #80 and #120). The rest seven methods successfully track the target until to the end and our DSWLSA method achieves the highest accuracy.

Benchmark evaluation
In this subsection, we evaluate our algorithm on the recent CVPR13 benchmark [47]. The precision plot and the success plot are applied to evaluate all the compared trackers. The success plot is based on the overlap rate and shows the percentage of frames whose overlap rate is higher than a given threshold t 0 ∈ [0, 1]. The area under the curve (AUC) of each success plot is often used to rank the tracking algorithms. The precision plot demonstrates the percentage of frames where the tracked location is within the given threshold distance to the ground truth. In Fig. 10, we report the results of one pass evaluation based on the average success and precision rate for the ten trackers. It can be seen from this figure that our DSWLSA method achieves very promising tracking performance (ranks 1 on the success rate and ranks 2 on the precision rate) in the benchmark evaluation.
In the recent CVPR13 benchmark, all the 50 video sequences are annotated with 11 different attributes: LR, in-plane rotation (IPR), OPR, SV, OCC, DEF, BCs, IV, MB, FM, and out-of-view (OV). Results on these annotated attributes can be used to analyse the strength and weakness of each tracker. Fig. 11 shows the success plot of each attribute. We note that the proposed tracker achieves top three results on 7 out of 11 attributes, and outperforms the baseline ASLA method on all 11 attributes. On the sequences with attributes of OPR and SV, the proposed tracker ranks first among all the evaluated trackers. For the sequences with attributes of OCC and IPR, the proposed tracker ranks second among all the evaluated trackers with a narrow margin (<1%) to the best performing method KCF. On the sequences with IV, DEF and background clutter attributes, the proposed tracker ranks third which follows the KCF and TGPR methods. The proposed tracker ranks outside of the top three for the sequences with MB and FM attributes since it employs a simple motion model based on stochastic search. We expect to use the optical flow to obtain the prior knowledge of the target location, as done in [9] and insert it for more robust tracking. For the sequences with an OV attribute, the proposed tracker ranks in the middle because we do not restore the target once it moves out of the view. Since we build target template and candidate target with simple grey-scale features, the proposed tracker does not achieve ideal performance on the sequences of LR attribute and ranks sixth. Table 3 reports the comparison of tracking speed in terms of frame per second (fps). All testing trackers are implemented under the same configuration for fair comparison. From Table 3, KCF, IVT and TLD rank in the top three. The proposed tracker performs not well since it is not optimised regarding speed in this work.

Proposed algorithm with different ratio factors
The ratio factor η is a crucial parameter in our method, which controls the trade-off between the contributions of two different scale models. In this subsection, we investigate the tracking performance of the proposed algorithm with different ratio factors in the collaborative model. Fig. 12 shows the quantitative results on the recent CVPR13 benchmark. Experimental results show that the proposed algorithm achieves the best performance when the value of the ratio factor η is 0.2.

Effectiveness of patch-based weight
In this subsection, we present experimental results to validate the effectiveness of the proposed patch-based weight. We evaluate ASLA model and our WLSA model on the recent CVPR13 benchmark. Here, our WLSA tracker and the baseline ASLA tracker use the default patch size of 16 × 16. Fig. 13 shows the precision plots and success plots of these two trackers. From the results, we can see that our WLSA tracker improves the tracking performance of the original ASLA tracker by 7.6% and 9.9% in success score and precision score, respectively. That can be attributed to the added patch-based weight. Our WLSA model weights the contaminated local patch less while the reliable patch more, which reduces the influence of noise caused by target appearance variations.

Validation of dual-scale design
In this subsection, we present experimental results to validate the effectiveness of our dual-scale design. Here, we use three trackers, including the dual-scale model DSWLSA and two single-scale models. WLSA (8 × 8) and WLSA (16 × 16) are utilised to denote the two single-scale models. Fig. 14 shows that the dual-scale tracker performs more robustly than the two single-scale trackers, which demonstrates the effectiveness of our dual-scale design.

Conclusion
In this study, we propose an effective WLSA model, which fully exploits structural and local information of the target through the patch-based weight and alignment pooling method. Then, this appearance model is carried out on two scale patches to model target appearance, respectively. The contributions of the two derived appearance models are integrated in a unified manner to construct the final dual-scale tracker, which increases the accuracy and robustness of tracking, on account of absorbing their advantages. Quantitative and qualitative comparisons with numerous state-of-the-art methods on a large benchmark dataset demonstrate the effectiveness and superiority of the proposed algorithm. Further work includes investigating a particle selection method to speed up the proposed tracker. In addition, we are planning to utilise efficient re-detection mechanism and location prior of online learning for more effective tracking.