Skip to content

jialinding/ee180lab2

Repository files navigation

David Pan (napdivad), Jialin Ding (jding09)
group18

Optimizations:
1) We added compiler flags to the Makefile in order to encourage auto-vectorization. In particular, the flags we added were ‘-mfpu=neon -O3 -ftree-vectorize -funsafe-math-optimizations -c’. As a result, the single-threaded implementation sped up to around 20 fps.

2) We restructured the code to encourage the compiler to vectorize. This was primarily implemented in the grayScale and sobelCalc functions. In particular, we used const local variables as the end conditions for our loops, which allows the compiler to determine the length of our loop. We also gave the compiler information about the size of our loops by adding ‘& ~3’ to the end condition, which tells the compiler that our loop length is divisible by 4. We also used local arrays in our key loops instead of the Mat inputs, which for mysterious reasons also sped up the fps. We hypothesize that this is because the compiler is more familiar with local arrays, and therefore is able to vectorize local arrays more easily than it can vectorize objects from OpenCV like Mat. All key loops were vectorized except for the loop that calculated the x-convolution in sobelCalc. It was unclear what was preventing that loop from vectorizing, since it was structurally very similar to the loop that calculated the y-convolution. We tried reordering the additions and multiplications involved in the calculations of each loop, but in the end we were unable to determine the cause of this discrepancy. Due to these changes, the single-threaded implementation sped up to 39 fps.

3) We tried using intrinsics as well on grayScale and sobelCalc. Our attempt can be seen in the attached file sobel_calc_intrinsics.cpp. However, this did not result in a significant increase in fps, likely due to poor implementation. We could’ve tried restructuring the code in order to perform fewer loads/stores or to maximize locality. In the end, we decided pure code restructuring resulted in higher fps than intrinsics.

4) We implemented multithreading by giving half of each image to each thread. The two threads would then call grayScale and sobelCalc in parallel, each processing half of the image. In order to synchronize the two threads, we created two barriers -- one for grayScale and one for sobelCalc. We forced the threads to synchronize four times -- before calling grayScale, after returning from grayScale, before calling sobelCalc, and after returning from sobelCalc. The multi-threaded implementation sped up to 56 fps.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published