Deep Learning Compilers

Soham Bhure
7 min readDec 11, 2020

--

INTRODUCTION

Deep Learning has produced a deep impact on the technological world, having diverse intelligent applications in a multitude of fields. With the emergence of new deep learning models like CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), LSTMs (Long Short Term Memory) as well as GANs (Generative Adversarial Networks), deep learning has truly pushed the limits of Artificial Intelligence.

Most of the applications of deep learning use what is known as a deep learning framework, which can be defined as a library or a tool that allows to generate deep learning models quickly and easily. Frameworks are preferred over writing code from scratch because they save time, increase efficiency and are robust in nature. As there are a variety of applications of deep learning, multiple frameworks including TensorFlow, Keras, PyTorch, MXNet, etc. have been developed to cater specific needs. For instance, ONNX has been built primarily to increase the interoperability between different framework models.

Every DL framework has an important component- a Compiler. Not unlike traditional compilers which are used to compile and test a piece of code, deep learning compilers are used by the deep learning frameworks to build the required models and perform the necessary functions. Deep learning compilers take framework models as input and generate optimised codes for a variety of deep learning hardware as output. With the constantly increasing need for speed, DL compilers have to be efficient in design and optimized for heavy usage.

Before diving into the architecture of a deep learning compiler, let’s take a look at the most popular deep learning frameworks and deep learning hardware categories.

FRAMEWORKS

TensorFlow:

TensorFlow is an open-source framework created by the Google Brain team for numerical computation and large-scale machine learning. It supports multiple languages including C++, Java, R and of course Python. It works on the principle of data-flow graphs which describe how data moves through a graph. TensorFlow is also designed for mobile and embedded deep learning and provides an Android neural network API. It is the most popular and widely used deep learning framework.

Keras:

Keras is not a Deep Learning Framework on its own, instead it provides an API to integrate with TensorFlow and MXNet, etc. Keras unlike TensorFlow is written in pure Python and its main goal is to provide a user friendly interface so as to build high level machine learning models.

PyTorch:

PyTorch has been built by Facebook primarily for scientific computing and deep learning. It has been created by rewriting Lua-based Deep Learning Framework- Torch. It is a dynamic framework and provides flexibility and high speed due to its hybrid front-end. It is recommended to use C++ while working with PyTorch and CUDA.

Caffe:

Caffe stands for Convolutional Architecture for Fast Feature Embedding. It was developed for deep learning and image classification by University of California- Berkeley and is mainly used for academic research projects along with large-scale industrial applications related to computer vision and speech recognition. Caffe has been written in C++ and has a Python interface for ease of use.

MXNet:

MXNet is yet another open source Deep Learning Framework designed to manage neural networks. It is highly scalable and supports a myriad of programming languages. These languages include — C++, Python, Matlab, JavaScript, R, Go, Perl, Scala, Wolfram and Julia. MXNet has been further developed into a framework known as Glueon which provides an advanced interface similar to Keras.

Theano:

Theano is an open source library used for scientific computing. It is one of the oldest deep learning frameworks available since 2007 and hence has a lot of documentation available for developers to refer to. It can run tasks faster than most other frameworks in case of a single-GPU environment, however cannot hold the top position in case of multi-GPU environment.

ONNX:

ONNX is an abbreviation for Open Neural Network Exchange developed by Microsoft with the support of Facebook and Amazon in 2017. It enables users to define extensible computational graphs along with different built-in operators and standard data types. It simplifies the process of transferring different models between multiple frameworks, i.e- it can intelligently switch between frameworks like PyTorch and Caffe.

HARDWARE

DL hardware can be categorized into two main sections:

1. General-Purpose Hardware:

General purpose hardware includes GPUs (graphic processing units) that are used to achieve a high degree of parallelism due to the presence of multiple cores. GPUs perform faster mathematical operations that help to improve code efficiency and speed of execution. One of the major companies involved in the production of GPUs is Nvidia.

2. Dedicated Hardware:

Although GPUs do help in faster calculations by performing in a parallel way, they are used for a variety of applications including gaming and video rendering and not just for deep learning computations. There has been a major development in the creation of DL specific hardware due to the rapid expansion of DL applications. The most famous among this category is Google’s TPU (Tensor Processing Unit). A TPU includes Matrix Multiplier Unit (MXU), Unified Buffer (UB), and Activation Unit (AU), which is driven with CISC instructions by the host processor. TPU runs on cloud and its pricing ranges from region to region. You can check the pricing plans here.

Google’s TPU

ARCHITECTURE

Deep learning compiler architecture is somewhat similar to a traditional compiler. Similar to a normal compiler, the architecture of a deep learning compiler is made up of two parts- frontend and backend. Furthermore, the front end of a deep learning compiler focuses on software optimization and the back end focuses on hardware optimization, not unlike the traditional compiler. Along with this, intermediate representation (IR) is present across the frontend and backend of a deep learning compiler. IR present in the frontend is also known as high level IR or Graph IR while IR present in the backend is also known as low level IR or Operator IR.

Computational Graph

Frontend -

The frontend of the compiler can be broadly divided into two main functions-

  1. Conversion: It receives DL framework models as input and this input is to convert the input into a High- Level IR (computational graph). High-Level IR is also known as Graph IR and it represents computation and control flow. Control flow simply means the order of instructions or statements. As High-Level IR is present in the frontend of the DL compiler, it is hardware independent and aims to establish the control flow and the dependency between the operators and the data.
  2. Optimize: Finally computation graph optimizations are performed which help to reduce the redundancies and boost the efficiency. It is important to note that all the optimizations are hardware independent and hence are generalized for diverse DL frameworks.

Backend -

The backend of a DL compiler works in the following way-

  1. Conversion: The previously generated optimized computation graph (High-Level IR) is received as input for the backend part. The High-Level IR is then converted to Low-Level IR. The primary aim for the conversion to Low-Level IR is because Low-Level IR is designed specifically for hardware specific optimization. Hence, Low-Level IR generated on different systems would be different reflecting the hardware characteristics of that particular system.
  2. Optimize: In the next step, hardware-specific optimizations are applied on the Low-Level IR. These optimizations include intrinsic mapping, memory allocation and fetching, memory latency hiding, parallelization as well as loop oriented optimizations. Finally, the optimized Low-Level IR is compiled using a Just-In-Time compiler or an Ahead-Of-Time compiler generating a hardware-specific code.

OPTIMISATIONS

Frontend-

Frontend optimizations are computational graph optimizations performed on software level. At the basic level Node-Level Optimizations are performed. These optimizations remove redundant nodes and replace high-cost nodes with lower-cost nodes. NOP eliminations are performed at this level. For instance, assume that A is a zero-dimension vector, and B is a constant vector, then the sum operation node of A and B can be replaced with the already existing constant node B without affecting the correctness.

At the next level, Block-Level Optimizations are performed. These optimizations include Algebraic Simplification which apply rules of commutativity, associativity, and distributivity and Operator Fusion and Sinking which combine multiple low-level operators to simplify computations.

Finally Dataflow-Level Optimizations are performed which includes Static Memory Planning that enables efficient usage of memory along with Dead Code Elimination which removes the part of code which does not contribute towards the output of the program.

Backend-

Backend Optimizations are hardware-dependent and change from hardware to hardware. Hardware specific optimizations include efficient memory allocation and mapping, better data reuse by the method of loop fusion and sliding window.

Another optimization performed is auto-tuning which optimally chooses a set of parameters to process the models faster. Auto-tuning also enables to accelerate the performance by parallelization, and reduces search time by applying genetic algorithms on the search space. Depending on the hardware different cost models are selected that boost the DL hardware performance.

Tensorflow’s XLA -

XLA (Accelerated Linear Algebra) is a compiler used by TensorFlow framework. It is a domain-specific compiler for linear algebra. XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes.

The results are improvements in speed and memory usage: most internal benchmarks run ~1.15x faster after XLA is enabled. When a TensorFlow program is run, all of the operations are executed individually by the TensorFlow executor. Each TensorFlow operation has a precompiled GPU kernel implementation that the executor dispatches to.

source — https://www.tensorflow.org/xla

XLA Architecture

Do Check out my other work here — https://www.sohambhure.com/

Authors

  1. Tanya Agrawal (Medium: Tanya Agrawal)
  2. Soham Bhure (Medium: Soham Bhure)
  3. Mihir Tale (Medium: Mihir Tale)
  4. Ganesh Tarone (Medium: GANESH TARONE)
  5. Rutuja Walke (Medium: Rutuja walke)

--

--