![]() |
OpenCV
4.10.0
Open Source Computer Vision
|
Next Tutorial: Using a cv::cuda::GpuMat with thrust
In the Video Input with OpenCV and similarity measurement tutorial I already presented the PSNR and SSIM methods for checking the similarity between the two images. And as you could see, the execution process takes quite some time , especially in the case of the SSIM. However, if the performance numbers of an OpenCV implementation for the CPU do not satisfy you and you happen to have an NVIDIA CUDA GPU device in your system, all is not lost. You may try to port or write your owm algorithm for the video card.
This tutorial will give a good grasp on how to approach coding by using the GPU module of OpenCV. As a prerequisite you should already know how to handle the core, highgui and imgproc modules. So, our main goals are:
You may also find the source code and the video file in the samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity directory of the OpenCV source library or download it from here. The full source code is quite long (due to the controlling of the application via the command line arguments and performance measurement). Therefore, to avoid cluttering up these sections with those you'll find here only the functions itself.
The PSNR returns a float number, that if the two inputs are similar between 30 and 50 (higher is better).
The SSIM returns the MSSIM of the images. This is too a floating point number between zero and one (higher is better), however we have one for each channel. Therefore, we return a Scalar OpenCV data structure:
As see above, we have three types of functions for each operation. One for the CPU and two for the GPU. The reason I made two for the GPU is too illustrate that often simple porting your CPU to GPU will actually make it slower. If you want some performance gain you will need to remember a few rules, for which I will go into detail later on.
The development of the GPU module was made so that it resembles as much as possible its CPU counterpart. This makes the porting process easier. The first thing you need to do before writing any code is to link the GPU module to your project, and include the header file for the module. All the functions and data structures of the GPU are in a gpu sub namespace of the cv namespace. You may add this to the default one via the use namespace keyword, or mark it everywhere explicitly via the cv:: to avoid confusion. I'll do the later.
GPU stands for "graphics processing unit". It was originally built to render graphical scenes. These scenes somehow build on a lot of data. Nevertheless, these aren't all dependent one from another in a sequential way and as it is possible a parallel processing of them. Due to this a GPU will contain multiple smaller processing units. These aren't the state of the art processors and on a one on one test with a CPU it will fall behind. However, its strength lies in its numbers. In the last years there has been an increasing trend to harvest these massive parallel powers of the GPU in non-graphical scenes; rendering as well. This gave birth to the general-purpose computation on graphics processing units (GPGPU).
The GPU has its own memory. When you read data from the hard drive with OpenCV into a Mat object that takes place in your systems memory. The CPU works somehow directly on this (via its cache), however the GPU cannot. It has to transfer the information required for calculations from the system memory to its own. This is done via an upload process and is time consuming. In the end the result will have to be downloaded back to your system memory for your CPU to see and use it. Porting small functions to GPU is not recommended as the upload/download time will be larger than the amount you gain by a parallel execution.
Mat objects are stored only in the system memory (or the CPU cache). For getting an OpenCV matrix to the GPU you'll need to use its GPU counterpart cv::cuda::GpuMat. It works similar to the Mat with a 2D only limitation and no reference returning for its functions (cannot mix GPU references with CPU ones). To upload a Mat object to the GPU you need to call the upload function after creating an instance of the class. To download you may use simple assignment to a Mat object or use the download function.
Once you have your data up in the GPU memory you may call GPU enabled functions of OpenCV. Most of the functions keep the same name just as on the CPU, with the difference that they only accept GpuMat inputs.
Another thing to keep in mind is that not for all channel numbers you can make efficient algorithms on the GPU. Generally, I found that the input images for the GPU images need to be either one or four channel ones and one of the char or float type for the item sizes. No double support on the GPU, sorry. Passing other types of objects for some functions will result in an exception throw, and an error message on the error output. The documentation details in most of the places the types accepted for the inputs. If you have three channel images as an input you can do two things: either add a new channel (and use char elements) or split up the image and call the function for each image. The first one isn't really recommended as this wastes memory.
For some functions, where the position of the elements (neighbor items) doesn't matter, the quick solution is to reshape it into a single channel image. This is the case for the PSNR implementation where for the absdiff method the value of the neighbors is not important. However, for the GaussianBlur this isn't an option and such need to use the split method for the SSIM. With this knowledge you can make a GPU viable code (like mine GPU one) and run it. You'll be surprised to see that it might turn out slower than your CPU implementation.
The reason for this is that you're throwing out on the window the price for memory allocation and data transfer. And on the GPU this is damn high. Another possibility for optimization is to introduce asynchronous OpenCV GPU calls too with the help of the cv::cuda::Stream.
On an Intel P8700 laptop CPU paired with a low end NVIDIA GT220M, here are the performance numbers:
In both cases we managed a performance increase of almost 100% compared to the CPU implementation. It may be just the improvement needed for your application to work. You may observe a runtime instance of this on the YouTube here.
1.15.0