Machine Learning – Halide Programming Insights

In the last decade, Moore’s Law continued to provide more and more transistors per area unit. In the beginning, computer architects used those transistors to increase single-thread performance, designing power-hungry central processing units (CPUs).

Dusan Nastic and Ugljesa Milic, Software developers at HTEC Group

The Need for Heterogeneous SoCs in Mobile Industry

With higher clock frequencies and an immense variety of microarchitectural enhancements, programmers would write their single-thread code, wait for the next generation of CPUs to come, and magically expect their applications to run faster. At one moment, power-wall became the first order constraint, and thus there was a need to build more energy-efficient processors.

With more and more transistors coming, chip manufacturers started to tailor and put different types of processing units on a single chip or entire compute node. The main idea was to keep CPUs to deal with typical and sequential workloads, while offloading parallel and specific applications to compute accelerators.

Today, there are different flavors of such accelerators, from big.LITTLE architecture introduced by the ARM, MICs from Intel integrated inside supercomputers, GPUs from NVIDIA and AMD used to deal with graphics, DSP units that handle voice and image processing, TPUs from Google, HPUs from Microsoft, etc.

In the mobile industry, with the limited power supply, limited cooling capabilities, and limited area, energy-efficiency is crucial to provide the better user experience.

In the mobile industry, with the limited power supply, limited cooling capabilities, and limited area, energy-efficiency is crucial to provide the better user experience. The typical system-on-chips (SoC) that can be found in mobile devices today are all heterogeneous, with different accelerators available.

For example, internet browsing and text editing run on CPUs, games on GPUs and DSPs execute various signal processing applications. Again, the main benefit is not only for performance improvement but also spending less energy. Still, it comes at a cost. These systems are not easy to program, and they require extensive HW-SW expertise to squeeze the performance and use all available resources available on a given SoC.

Deep Learning Algorithms Becoming Popular

Recently, we have witnessed some significant breakthroughs in the area of machine learning, from both industry and academia. Processing huge amounts of data in order to train neural networks is usually done server-side, using more capable but power-consuming CPUs and GPUs. However, it is crucial to moving the computation, network training, and inference, from server-side, to edge devices, meaning processing is done on your mobile phone, rather than sending data back to the cloud.

It is more secure, quicker and more reliable, and it costs less, for both the customer and the provider. The latest generation of mobile phones coming from Apple, Samsung, or Huawei, all use some sort of deep learning to provide user-friendly features, such as face unlock. Moreover, some of them implement dedicated compute units inside their SoCs, such as Neural Engine from Apple in their A11 chips, to execute these popular workloads efficiently.

Improved Programmability by Using Halide

Programming heterogeneous systems is not an easy task. On the one hand, programming models can provide a rich set of instructions and API calls that allow better managing of available resources, at the cost of complex code and in-depth knowledge of HW-SW interaction.

On the other hand, they can provide high-level abstractions that produce lean and clean code, at the cost of performance. Halide programming model tries to merge best from both worlds. It is originally designed at MIT, focusing on image processing pipelines. The main idea is to decouple user code into two parts: algorithm and schedule. The algorithm determines the logical part of the computation, what the inputs and the outputs are, and what should be computed. The scheduling part defines how this algorithm should be executed on the available hardware.

For example, if a user wants to blur the input two-dimensional array of pixels (an image), the algorithm itself implements the procedure, while the schedule tells the compiler which part should be parallelized, or vectorized, prefetched, decomposed, tilled, etc. Figure 1 shows a code snippet for this example. The main benefit of using Halide is the quick and easy development of algorithms with high-level wrapper definitions. This is followed by further performance improvement through the scheduling optimizations.

Figure 1: Halide code for 3×3 blur pipeline over a 2D input image. The algorithm is defined in just two lines, while the scheduling part tries to tailor its execution for a given processor architecture (break input image into tiles to improve the locality, vectorize across inner dimension to utilize vector units, parallelize across outer dimension to run on multiple threads or cores, etc.) For more examples, take a look at Halide homepage.

The main benefit of using Halide is the quick and easy development of algorithms with high-level wrapper definitions.

Putting It All Together

Here at HTEC, we are using the Halide programming model to implement and optimize deep learning workloads for mobile heterogeneous SoCs. Although it is focused on image processing, Halide can be used to implement deep learning applications efficiently. Instead of dealing with pixels, training convolution networks assume different elements (half-words and words) but the same data structures, like N-dimensional arrays (vectors, matrices, tensors). Moreover, there is support coming from the industry to Halide community to provide better code for different architectures and ISAs. This makes it suitable for usage in heterogeneous mobile systems. Specifically, we are closely collaborating with Qualcomm, porting and optimizing TensorFlow deep learning framework to be executed on their Hexagon DSPs.

Quite different from CPUs, Qualcomm’s programmable DSP is an in-order, VLIW core, that runs up to four hardware threads, and implements two wide vector units. With rich data-level parallelism, deep learning algorithms fit naturally to vector execution, making a DSP particularly interesting. Figure 2 compares the DSP unit with CPU and GPU looking at the performance and energy consumption. If exploited efficiently, Hexagon DSP can provide more performance than a CPU or even a GPU, consuming less energy; a win-win scenario for mobile devices.

Figure 2: According to Qualcomm, Hexagon DSP processor is capable of providing 8X and 4X performance improvement compared to their Kryo CPU and Adreno GPU, respectively. At the same time, it is 8X and 25X more energy-efficient. More interesting comparisons can be found here.

Related Posts