This code shows how to track pedestrians using a camera mounted in a moving car.
This code shows how to perform automatic detection and tracking of people in a video from a moving camera. It demonstrates the flexibility of a tracking system adapted to a moving camera, which is ideal for automotive safety applications. Unlike the stationary camera example, The Motion-Based Multiple Object Tracking, this code contains several additional algorithmic steps. These steps include people detection, customized non-maximum suppression, and heuristics to identify and eliminate false alarm tracks.
Auxiliary Input and Global Parameters of the Tracking System:
This tracking system requires a data file that contains information that relates the pixel location in the image to the size of the bounding box marking the pedestrian’s location. This prior knowledge is stored in a vector
pedScaleTable. The n-th entry in
pedScaleTable represents the estimated height of an adult person in pixels. The index
n references the approximate Y-coordinate of the pedestrian’s feet.
To obtain such a vector, a collection of training images were taken from the same viewpoint and in a similar scene to the testing environment. The training images contained images of pedestrians at varying distances from the camera. Using the Image Labeler app, bounding boxes of the pedestrians in the images were manually annotated. The height of the bounding boxes together with the location of the pedestrians in the image were used to generate the scale data file through regression. Here is a helper function to show the algorithmic steps to fit the linear regression model:
There is also a set of global parameters that can be tuned to optimize the tracking performance. You can use the descriptions below to learn about how these parameters affect the tracking performance.
ROI: Region-Of-Interest in the form of [x, y, w, h]. It limits the processing area to ground locations.
scThresh: Tolerance threshold for scale estimation. When the difference between the detected scale and the expected scale exceeds the tolerance, the candidate detection is considered to be unrealistic and is removed from the output.
gatingThresh: Gating parameter for the distance measure. When the cost of matching the detected bounding box and the predicted bounding box exceeds the threshold, the system removes the association of the two bounding boxes from tracking consideration.
gatingCost: Value for the assignment cost matrix to discourage the possible tracking to detection assignment.
costOfNonAssignment: Value for the assignment cost matrix for not assigning a detection or a track. Setting it too low increases the likelihood of creating a new track, and may result in track fragmentation. Setting it too high may result in a single track corresponding to a series of separate moving objects.
timeWindowSize: Number of frames required to estimate the confidence of the track.
confidenceThresh: Confidence threshold to determine if the track is a true positive.
ageThresh: Minimum length of a track being a true positive.
visThresh: Minimum visibility threshold to determine if the track is a true positive.
Create System Objects for the Tracking System Initialization:
setupSystemObjects function creates system objects used for reading and displaying the video frames and loads the scale data file.
pedScaleTable vector, which is stored in the scale data file, encodes our prior knowledge of the target and the scene. Once you have the regressor trained from your samples, you can compute the expected height at every possible Y-position in the image. These values are stored in the vector. The n-th entry in
pedScaleTable represents our estimated height of an adult person in pixels. The index
n references the approximate Y-coordinate of the pedestrian’s feet.
initializeTracks function creates an array of tracks, where each track is a structure representing a moving object in the video. The purpose of the structure is to maintain the state of a tracked object. The state consists of information used for detection-to-track assignment, track termination, and display.
The structure contains the following fields:
id: An integer ID of the track.
color: The color of the track for display purpose.
bboxes: A N-by-4 matrix to represent the bounding boxes of the object with the current box at the last row. Each row has a form of [x, y, width, height].
scores: An N-by-1 vector to record the classification score from the person detector with the current detection score at the last row.
kalmanFilter: A Kalman filter object used for motion-based tracking. We track the center point of the object in image;
age: The number of frames since the track was initialized.
totalVisibleCount: The total number of frames in which the object was detected (visible).
confidence: A pair of two numbers to represent how confident we trust the track. It stores the maximum and the average detection scores in the past within a predefined time window.
predPosition: The predicted bounding box in the next frame.
detectPeople function returns the centroids, the bounding boxes, and the classification scores of the detected people. It performs filtering and non-maximum suppression on the raw output of the detector returned by
centroids: An N-by-2 matrix with each row in the form of [x,y].
bboxes: An N-by-4 matrix with each row in the form of [x, y, width, height].
scores: An N-by-1 vector with each element is the classification score at the corresponding frame.
Predict New Locations of Existing Tracks:
Use the Kalman filter to predict the centroid of each track in the current frame, and update its bounding box accordingly. We take the width and height of the bounding box in previous frame as our current prediction of the size.
Assign Detections to Tracks:
Assigning object detections in the current frame to existing tracks is done by minimizing cost. The cost is computed using the
bboxOverlapRatio function, and is the overlap ratio between the predicted bounding box and the detected bounding box. In this example, we assume the person will move gradually in consecutive frames due to the high frame rate of the video and the low motion speed of a person.
The algorithm involves two steps:
Step 1: Compute the cost of assigning every detection to each track using the
bboxOverlapRatio measure. As people move towards or away from the camera, their motion will not be accurately described by the centroid point alone. The cost takes into account the distance on the image plane as well as the scale of the bounding boxes. This prevents assigning detections far away from the camera to tracks closer to the camera, even if their centroids coincide. The choice of this cost function will ease the computation without resorting to a more sophisticated dynamic model. The results are stored in an MxN matrix, where M is the number of tracks, and N is the number of detections.
Step 2: Solve the assignment problem represented by the cost matrix using the
assignDetectionsToTracks function. The function takes the cost matrix and the cost of not assigning any detections to a track.
The value for the cost of not assigning a detection to a track depends on the range of values returned by the cost function. This value must be tuned experimentally. Setting it too low increases the likelihood of creating a new track, and may result in track fragmentation. Setting it too high may result in a single track corresponding to a series of separate moving objects.
assignDetectionsToTracks function uses the Munkres’ version of the Hungarian algorithm to compute an assignment which minimizes the total cost. It returns an M x 2 matrix containing the corresponding indices of assigned tracks and detections in its two columns. It also returns the indices of tracks and detections that remained unassigned.
Update Assigned Tracks:
updateAssignedTracks function updates each assigned track with the corresponding detection. It calls the
correct method of
vision.KalmanFilter to correct the location estimate. Next, it stores the new bounding box by taking the average of the size of recent (up to) 4 boxes, and increases the age of the track and the total visible count by 1. Finally, the function adjusts our confidence score for the track based on the previous detection scores.
Update Unassigned Tracks:
updateUnassignedTracks function marks each unassigned track as invisible, increases its age by 1, and appends the predicted bounding box to the track. The confidence is set to zero since we are not sure why it was not assigned to a track.
Delete Lost Tracks:
deleteLostTracks function deletes tracks that have been invisible for too many consecutive frames. It also deletes recently created tracks that have been invisible for many frames overall.
Noisy detections tend to result in creation of false tracks. For this example, we remove a track under following conditions:
- The object was tracked for a short time. This typically happens when a false detection shows up for a few frames and a track was initiated for it.
- The track was marked invisible for most of the frames.
- It failed to receive a strong detection within the past few frames, which is expressed as the maximum detection confidence score.
Create New Tracks:
Create new tracks from unassigned detections. Assume that any unassigned detection is a start of a new track. In practice, you can use other cues to eliminate noisy detections, such as size, location, or appearance.
Display Tracking Results:
displayTrackingResults function draws a colored bounding box for each track on the video frame. The level of transparency of the box together with the displayed score indicate the confidence of the detections and tracks.
See the Video:
Recommended For You: