
Instance Segmentation by combined model of Yolo V3 + FCN
Keywords: Instance Segmentation, FCN, Yolo, Mask-R-CNN
Abstract:
Mask-R-CNN was one of the state-of-the-art models for instance segmentation. However, the inference time was critizied to be slow. To make good use of the speed of Yolo V3 model for object detection, I came up with an idea of implementing both the Yolo model and FCN model for this task. Yolo V3 used Darknet-53 is much more faster than the Resnet backbone in the Mask-R-CNN. Also, many algorithms go about this task a little differently however one thing is usually common is that where one would usually find a full-connected layer (FCN), it is replaced by some deconvolutional / upsampling layers. As a result, the output maintains the original size of the image input whilst also highlighting which pixels in the image correspond to which class. My new combined model was innovative. But some adjustments were needed to be made before it can beat the Mask-R-CNN model.
Instance Segmentation
​
Instance Segmentation is an advanced technique which locates each pixel of every object in the image in addition to the bounding boxes. The new technique produces masks for every object which can help to increase the accuracy of detecting each similar object so that the robots will not mix them up. Since the invention of the technique, it is widely used in the development of self-driving cars.
​
Mask-R-CNN
It is one of the state-of-the-art models for instance segmentation. The model expands upon the work of Faster-R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box regression. The backbone used for the model was one of the Resnet architectures to extract features from the image to which it is then passed through a Region Proposal Network (RPN) to generate Regions Of Interest (ROI) from the feature map. A ROI-Align network is then applied to warp the ROI’s into a fixed dimension. This is then fed into fully connected layers to make an object classification and bounding box prediction and into the mask classifier to generate the image mask, all in parallel. Mask-R-CNN was ground-breaking at the time due to it outperforming a lot of other state of the art models of the time whilst still being simple and fast to run.
​
​
​
Yolo v3 + FCN
​
The model proposed here consists of 4 main stages. In the first stage, the images are input to the Yolo Version 3 network which uses Darknet-53 as its backbone. The model outputs the object classes, confidences of the detection and bounding boxes of each object. The Yolo network was pretrained on the COCO dataset. The output bounding boxes then were extracted as separate images in stage 2. All of these bounding boxes images were input to the pretrained Fully Convolutional Network (FCN) for pixel-wise classification, to generate the mask for the object in bounding box. In the final stage, the masks were reproduced in each separate bounding box image. Finally, the masks were pasted back to the original bounding box location.

Dataset:
The dataset used was the subset of COCO dataset. COCO dataset is renowned for its well labeled data and it was used as benchmarks for many image classifications. In our investigation, we sorted out 5 categories, ‘person’, ‘chair’, ‘couch’, ‘dining table’ and ‘toilet’. The ground truth masks only include those 5 groups. We had 2 main reasons to choose these 5 categories. Those 5 classes of object appear to be observed indoor. As per our research domain, the indoor environment should be more relevant. The second reason being the time concern. Focusing on 5 categories would be enough to test the model.
To build our own subset of COCO, a new json file of the same format as the COCO annotations files was created, and by using the COCO API, we parsed through the entire dataset extracting all the image and annotation information pertaining to our 5 classes and appending each into a list. As a result of this we could easily extract all the images necessary into a separate file. In addition to this since our json file was of the COCO format, we could use the COCO API to extract the masks and bounding boxes when building our dataset class for our data-loader.
​
​
Implementation
​
Stage 1: Deploying Yolo
In this stage, we use of Yolo v3 to detect the images and outputting the bounding boxes, confidence scores and classes number. The backbone network used in Yolo v3 is the Darknet-53. In our implementation, the Darknet was provided by source code. The network was pretrained on the COCO dataset where the pretrained weight parameters was from the original paper.
To begin with, the inputting images have to be preprocessed to resize into (Batch x Channel x Height x Width). The height and width size should be equal to (256x256) in order to match the neural network. After a single forward pass through Darknet, the output tensor is a size of (Batch size x Observation x 85). Observation number is 4032. The Darknet-53 output observation in 3 different scales (192, 768 and 3072 Observations) which give a total number of 4032. The final dimension of the output was 85 which consists of the components, bounding box coordinates (4), confidence scores (1), and classes confidences (80). The output tensor carried out non-max suppression which is outputting the bounding boxes with highest class confidences scores. After this, the observation with confidence scores lower than a threshold value would also be suppressed. To be noted, this threshold value is also a hyperparameters to be tuned and revised. Initially, the threshold value was set to be 0.75.
As mentioned above, the model was loaded with the pretrained weights on COCO dataset. One thing we needed to handle was to suppress the classes that were not interested in. After that, the output from the above process was then converted to a prediction variable which has a shape of (Number of Observation remaining x 8). The 8 columns included the Image Number (1), Bounding Box Coordinates (4), Class Confidence (1), Scores (1), and Class Number (1). The bounding box coordinates was converted in order to match the original shape of image. These tensors included the useful information for the next stage process.
​
​
​
Stage 2: Extraction Bounding Box
In this stage, the prediction variable was sliced to include only the bounding box coordinates and changed back to integer values in order to get the correct coordinates. The bounding box image was an array due to the slicing from the original image. Transforming to tensor was carried out which fulfil the Pytorch setting requirement.
​

Stage 3: Training and Deploy FCN8s
As discussed before FCN-8 had been already trained on the Pascal-VOC 2012 dataset before to produce very good results. However, in our case we are interested in using FCN-8 as a binary classifier (only generating a mask for the background and object). For this to work the last six convolutional were modified to make predictions for two classes only. As results new weights are generated using the He method of initialization. Furthermore, the Pascal-VOC 2012 data used to train the model was slightly modified to work for binary classification. Since we are no longer interested in the model’s ability in distinguishing between different classes in the image, the ground truths are modified to produce a single mask for all the classes in the image. An example of this can be seen in Fig. 3, whereby a single mask is produced for the two classes present. All images into model were resized to (224,224) to allow for the use of mini batches. Also, the pre-processing of the images followed that of the research behind FCN-8 in which the per channel mean is subtracted from the image tensor.
​
​
​
Stage 4: Masking Process
In this stage, we aimed at combining all the information obtained from the above to form the instance segmentation. FCN8s outputted a 2 layers tensor with shape equal to the inputting dimension. In here, we detached the tensors and change to Numpy arrays for easier manipulation. The masks were padded with Zeros for the areas not masked to the original shape.
In the last procedure, we need to handle the mask area overlapping problem. Objects in two overlapped bounding boxes showed in the following, the green bounding box and the red bounding box also cover the person. There was overlapping area between the two bounding boxes. Both the green box and the red box mask the head of the person. As we only separate the background and the object by binary classifier, the green box may treat the head of person as object and mask it.
To solve this problem, the confidence scores of the objects in both bounding boxes would be compared. In the situation overlapping, one with higher confidence scores was used to mask the object. The overlapping area was eliminated from the one with smaller confidence scores. For example, in the figuer, the confidence scores of red bounding boxes was 0.9678 and the green one was 0.8998. As the score of the red box was higher than the green one, the overlapping area belonged to the red box.


Evaluation Metric
​
​
The Evaluate Metric used was the Mean Average Precision 50% IoU ratio and 75% IoU ratio. It is widely used in the image classification tasks for the Machine Learning community.
Definition of Average Precision: AP= In0->1(p(r))dr where p(r) is a function of precision given recall. The physical definition is the area under the precision over recall curve. In different competitions, like the COCO dataset and Pascal VOC dataset, there are some subtle different practices in the calculations. In our investigation, the following approach was used in estimating the mAP50 and mAP75 value.

Conclusion
​
​
The test result were shown below:

The results shown above were based on the scores recorded from the first 1000 images from the COCO dataset. The New Model performed not as good as the state-of-the-art Mask R-CNN model. The discrepancy increased in results for AP75. To be noted that, the performance for the AP-Person for AP50 of the New Model was exceptionally good compared to other classes. However, the inference time for both models were remarkable in the GPU.
​
​
Evaluation and Future Work:
The new model was inspired by the concept of Hard Attention in the image captioning problem. With hard attention, the image was cropped so that the other part in the image does not have to be carried out in backpropagation. The cropping technique could also be applied to different problems like image detection.
​
From the above experiment and investigation, some of the possible amendments could be implemented to improve the performance. One of the drawbacks from the new model was that it highly depended on the output bounding boxes from the Yolo. Yolo v3 was one of the best image detection models so far. The poor detection accuracy experienced may be due to the poor threshold value setting in the non-max suppression. The value could be finetuned to optimize the detection. To further improve the efficiency, we could dig into the model and combine the concept of RPN from Mask R-CNN and the three different scales observation output in the Darknet-53 to make the mask instead. This can directly increase the speed of the overall inference time.
​
Another drawback of the model is that the FCN used VGG-16 to extract a feature map. This is quite inefficient since a feature map of the original image already exists since the image passed through yolov3 backbone. In addition, it was the source of the multiple objects stacking error since the feature extractor in FCN-8 was trained on Pascal-VOC which contained other classes we were not interested in. Future work could look to use that instead alongside some RPN, thus reducing inference time. Furthermore, future work could look at finetuning the entire FCN-8 model to build a better generalisation to the problem domain.