Real-time Portrait Segmentation on Smartphones

Published in

Prisma Labs Blog

6 min readOct 17, 2018

Demonstration of segmentation value: original image and simulated bokeh effect

Various image effects have been receiving increasing attention in recent years. A popular example is bokeh, a blur on an out-of-focus region in a photograph. This effect is achieved by using a fast camera lens with a wide aperture. Unfortunately, it is almost impossible to reproduce this kind of effect with mobile phone cameras because they do not meet the necessary technical specifications. However, if each image pixel is classified into person and background categories, the bokeh effect can be simulated by blurring just the background. Classification of each pixel into categories is called semantic segmentation, and it can be used in various ways, such as changing the image background or applying separate filters for foreground and background.

Some devices may use stereo cameras that are able to extract depth information in order to segment an image. The approach in this article, however, aims to build a segmentation system that would be able to infer the desired information from a single RGB image. This would allow us to apply portrait effects on a frontal camera for a broader set of devices.

Significant progress was made over the years in the field of computer vision and, particularly, in semantic segmentation. It was achieved with the help of Convolutional Neural Networks. Such neural networks are effective at abstracting information from an image and inferring its qualities, including segmentation maps. However, most architectures require a lot of computational power and good parallelization on the device side. This requirement and the widespread popularity of neural networks encouraged device manufacturers to create new frameworks. The frameworks enable neural networks to be evaluated on a mobile GPU. These solutions, in most cases, achieve greater performance than ones based on CPUs.

Despite the popularity of deep learning in computer vision, all of the segmentation models that we have found have some disadvantages. In most cases, they are too slow to be inferred in real-time on mobile devices. Another complication is compatibility with mobile frameworks — not all layers are currently supported. For this reason, we designed a special architecture which fits all the restrictions.

Finally, the last challenge is the training dataset. To achieve the best performance, we collected a dataset for this task and created a specific augmentation pipeline.

Dataset

Neural networks need training, so the major part of building this type of Neural networks need training, so a major part of building this type of algorithm is a dataset. We wanted segmentation maps to be as accurate as possible, so all of the images in our dataset were manually labelled with a professional editing program. Furthermore, we wanted the dataset to be diversified. A lack of diversity may cause the algorithm to make structural mistakes. For example, if a model can’t handle a specific pose, it may be that the dataset is not representative enough. We used two ways to improve our dataset quality: augmentation and collection of extra images. The amount of images in our dataset incrementally increases over time. At the moment, we have about 13 thousand images in a training set and an additional thousand in a test set, with people in a variety of poses, containing full-body, portrait and selfie photos.

Image augmentation: original image and two examples of random augmentation

Architecture

Neural networks need an architectural design. There are plenty of different layers and methods that can be used to combine them, each with its own advantages. Before specifying our network, we need to understand what features of the architecture are the most desirable. We want it to be fast, lightweight and compatible with most mobile frameworks. To achieve this, we sacrifice convergence rate and skip the regularization layers, preventing overfitting by using a small model size and massive augmentation.

Furthermore, it is crucial to use layers which are compatible with target platforms. Each framework has its own set of implemented operations, and we have to use the intersection over all these sets. We use the following layers: convolution, transposed convolution, relu, sigmoid and element-wise addition.

The last architectural challenge is efficiency. One of the key parameters is input image shape. We can use our model on an image scaled to lower resolution and upscale the output to the original shape. The lower the input shape is, the faster is the inference. However, making the resolution too low causes degradation of the resulting semantic mask quality due to upscale artefacts. The highest resolution that we were able to process in real-time was 256x256.

The final topology of the network is based on a U-Net encoder-decoder structure. In our experience, it is the best starting point for image segmentation with a low number of classes. Nevertheless, our architecture incorporates a few significant changes.

To improve convergence and make the model more accurate, we replaced the concatenation of features after upsampling with element-wise addition.

The next key feature is depth-wise separable convolution. Comparing with ordinary convolutions, it allows the same quality to be achieved with less computation cost and memory requirements. We add a residual block with depth-wise separable convolutions as the base block in our architecture instead of the original Convolution + ReLU blocks. Another feature of our architecture is asymmetry. A decoder contains more blocks than an encoder. This allows us to achieve better accuracy with the same computational cost. Finally, to achieve real-time inference, we sought to reduce the number of layers and feature maps as much as possible.

The final architecture is depicted in the figure below.

Training tricks

The next important stage is training. We have a two-step training process that leads to a better quality result. At first, we pre-train our model on all pictures from the dataset. The next step is a training model designed for selfie- and portrait- subsets that uses the result of the previous step as initial values for weights. In order to broaden our dataset during both steps, we massively use the augmentation. Examples of this augmentation are image rotation, curve adjustments and shade simulation. Our augmentation pipeline is built on two main concepts. The first is not to spoil the images too much — after the augmentation, the image should seem real. The second is to try to mask the flaws of the dataset. For example, in our experiments, motion blur imitation made a significant impact on the stability of video processing.

Another point we want to mention are output artefacts. The neural network tends to generate checkerboards — a grid that can be seen after applying background subtraction. This type of artefact does not influence the mIoU metric, but creates a poor visible experience. To solve this issue we add additional residual blocks on the final shape and give extra training time.

Results

Segmentation output: original image, background subtraction, foreground subtraction

Finally, we get a model for portrait segmentation that is well-balanced between quality and speed. It takes up 3.7 mb in fp32 onnx format. Our algorithm is transferable to a majority of frameworks, including CoreML, MetalPerformanceShaders, SNPE, Huawei Kirin, OpenVINO. Specifically, on an iPhone 7 and a Mi Mix 2S, this model succeeds in having a runtime faster than the 30 fps threshold. The pipeline includes getting a pixel buffer from a camera, preprocessing, inference, postprocessing and rendering of the final image with a removed background on the screen.

Bokeh simulation: image with and without background blur

Post Scriptum

The presented portrait segmentation system was created in collaboration with my remarkable colleagues. Unfortunately, there is no official co-authoring option on Medium, but they are mentioned below:

Lebedev Anton, Konstantin Semyanov, Artur Chakhvadze — training pipeline and network architecture
Roman Kucev, Pavel Voropaev — data collection and annotation process management
Maxim Skorokhodov, Vyacheslav Tarasov — Android integration
Oleg Poyaganov, Andrey Volodin — iOS integration
MR — review of this article