YOLOv3 Real-time Object Detection (Review 2023)

YOLOv3 (You Only Look Once, Version 3) is a real-time object detection algorithm that identifies specific objects in videos, live feeds, or images. YOLO uses features learned by a deep convolutional neural network to detect objects. Versions 1-3 of YOLO were created by Joseph Redmon and Ali Farhadi.

The first version of YOLO was created in 2016, and version 3, which is discussed in this article, was made two years later in 2018. YOLOv3 is an improved version of YOLO and YOLOv2.

YOLOv3 is a real-time detection neural network

YOLOv3 Architecture

Darknet-53

The network architecture used by YOLOv2 is darknet-19, with a total of 24 layers, including 19 convolutional layers (called darknet-19) and 5 maximum clustering layers. YOLOv2 is not very effective in detecting small targets, because some detailed features are lost when downsampling the input. To solve this problem, YOLOv2 uses identity mapping and connects feature maps from the previous layer to obtain low-level features.

However, YOLOv2's network architecture does not include algorithms that are now considered state-of-the-art, such as residual blocks, skip connections and upsampling. But the exciting thing is that they have all been added to YOLOv3.

YOLOv3 uses a new darknet architecture with 53 convolutional layers (also called darknet-53) and is trained on the ImageNet dataset.

Detect Under Three Scales

YOLOv3 makes predictions at three scales, which are accurately given by reducing the input image size to 32, 16 and 8, respectively.

The first prediction is performed by the 82nd layer. For the first 81 layers, the network reduces the image resolution so that the pitch of the 81st layer is 32. If we have a 416 × 416 image, the size of the resulting feature map is 13 × 13. Here we use a 1 × 1 detection kernel to give us a 13 × 13 × 255 detection feature map. Next, the feature map of layer 79 traverses several convolutional layers and is then sampled 2x to dimension 26x26. This feature map is then deeply concatenated with the feature map of layer 61. The combined feature map then goes through several 1 × 1 convolutional layers again to merge features from previous layers (61). Then, the second detection is performed on layer 94, resulting in a 26 × 26 × 255 detection feature map.

Similarly, the feature map of layer 91 undergoes some convolutional layers with the feature map of layer 36 before forming a deep cascade. As before, follow several 1 × 1 convolutional layers to merge the information from the previous layer (36). We generate a feature map of size 52 × 52 × 255 in layer 106 as the third detection.

Better Detection of Small Targets

Different detection levels help solve the problem of small target detection, and the up-sampling layer cascaded with the top layer helps retain the detailed features that help detect small targets.

The 13 x 13 layer is responsible for detecting large objects, while the 52 x 52 layer detects smaller objects and the 26 x 26 layer detects medium-sized objects.

YOLOv3 neural network architecture

Anchor Boxes

YOLOv3 uses a total of 9 anchor boxes. Three in each ratio. If you train YOLO on your own dataset, you must use K-Means clustering to generate 9 anchor points.

Then, place the anchor box in descending order of dimension. Specify the three largest anchor boxes for the first scale, specify the next three anchor boxes for the second scale, and specify the last three anchor boxes for the third scale.

More Bounding Boxes

For input images of the same size, YOLOv3 predicts more bounding boxes than YOLOv2. For example, the original resolution of YOLOv2 is 416 x 416 and it is predicted that 13 x 13 x 5 = 845 boxes. In each grid cell, 5 boxes are detected using 5 anchor points.

On the other hand, YOLOv3 predicts 3 different proportions of boxes. For the same 416 x 416 image, the number of prediction boxes is 10647. This means that YOLOv3 predicts 10 times the number of boxes predicted by YOLOv2. That is why it is slower than YOLOv2. At each scale, each grid can use 3 anchors to predict 3 boxes. Since there are three scales, the total number of anchor boxes used is 9, three for each scale.

YOLOv3 real-time object detection demonstration

Softmax Abandoned

YOLOv3 performs multi-label classification on objects detected in images. Previously in YOLO, the author was used to softmax level scores and took the highest score as the class of objects contained in the bounding box. This has been modified in YOLOv3.

Softmaxing is based on the assumption that classes are mutually exclusive, or simply put, if an object belongs to one class, then it cannot belong to another class. This works well in the COCO dataset.

However, when we have classes like Person and Woman in the dataset, the above assumption fails. That is why the YOLO authors did not smooth these courses. Instead, logistic regression is used to predict the score for each category and thresholds are used to predict multiple labels for the target. Categories with scores above this threshold will be assigned to the box.

How to use YOLOv3

YOLOv3 was the first state-of-the-art artificial neural network added to Theos AI, the most powerful development platform for AI, that gives every developer without any AI experience the power to create one of their own in the easiest and fastest way possible. And the best of all is that you can try it now, it’s free forever.

Previous
Previous

How Many Images Do I Need for My Computer Vision Model? (2023 Guide)

Next
Next

Introduction to Computer Vision (Ultimate Guide 2023)