Cheap Deep Convolutional Neural Network architectures


    Sources are listed at the end of the article

    Neural networks have existed since late 1950s but only recent developments in the last decade have invigorated the scientific community. Convolution neural networks are outperforming almost all traditional computer vision algorithms. As opposed to fully connected neural networks, CNNs make the inherent assumption that the input is an image (a n dimension array of pixels or numbers).

    Traditional neural network's nodes in layer (n+1) are connected to every node in layer n. This topology doesn't scale very well with the size of the input. If a conventional neural network's connections are to be restricted, we can enable our network to learn some pretty interesting features. The first input layer is convoluted with a set of randomly initialized weights (visualized as filter kernels in a standard convolution e.g. Gaussian filtering or Sobel filtering). A set of predefined functions make up a layer in CNNs. A typical CNN layer might have Convolution, Pooling (Max, Average) and non linearity as part of its layers, and these layers are repeated 'd' number of times for a CNN of total depth 'c=d+f' where f is the number of fully connected layers. Fully connected layers are just traditional neural networks with a softmax function at the end to convert the output into 'Probabilistic' numbers for classification. The number of outputs typically denote the number of classes a network is able to identify.

    There are a number of scientific computing tools (Torch, Caffe, Theano, Tensorflow) which enable one to make, tweak and train CNNs and there is Nvidia CUDA support available to create parallel computing threads. These tools have been a monumental supplement for an average non-mathematician computer science student to make deeper and sometimes more efficient CNNs but the very nature of these CNNs make them bulky and memory intensive.

    While GPUs as aids are essential in training complex and efficient CNNs (and highly efficient at this job owing to massive parallelism involved in their architecture), making this technology available in the form of an SoC is still a dream. CNNs and NNs can be used for a number of complex computing tasks (case in point: computer vision) Computer Engineers in research groups all over the world are moving towards 'neuromorphic computing' and this paradigm has given rise to the need for efficient neuromorphic architectures.

    Realizing Neuromorphic architectures in silicon pose challenges. These networks are memory and computation intensive,which makes memory organization a daunting task as reading and writing from the DRAM costs clock cycles and energy. Efficient state of the art CNN hardware accelerators today are primarily designed for inference and the reasons for that are listed below

    Trained networks are massive. In Alex-Net, a popular CNN, there are 55*55*96 = 290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone[Thanks Andrej karpathy cs231 - stanford]. even state of the art GPUs are hard pressed to reduce computation time to acceptable training rates from a matter of days to even hours (using single precision). Custom hardware can organize the memory bandwidth for full utilization and speed up the process.

    Energy: GPUs consume a phenomenal amount of energy. Custom hardware can reduce energy consumption. The aim of manufacturing a computation SoC is to reduce energy/computation (one of the holy grails of artificial intelligence).

    On Chip learning currently is not practical: Apart from a massive time needed to train the network for a pre-decided number of classes, an application based machine learning hardware today is limited to carrying out inference in a feed-forward path.

    So what do you need to build a custom CNN hardware? First of all, a reasonably small network. Custom hardware for deep learning is usually tailor made for a task, e. g. Mnist or cifar - 10. This keeps parameters low and hence dram communication to a minimum. Designing a deep network for custom hardware sometimes also involves restricting the parameters data type. You basically have to figure out what works best for your desired accuracy. It goes without saying that your new network has to be extensively tested with a standard CNN scientific computing tool like torch.

    The second most important skill you need would be hardware design, more specifically, asic design experience with emphasis on memory organization. Computations in deep networks are simple functions. The memory organization decides whether your custom deep network hardware would be fast or slow. It is basically a trade off between accuracy and memory (e. g. Integers are easy to store and faster to operate on than floating point numbers but then it's a hit on accuracy as compared to the single precision on gpus).

    A more elegant way to curb this problem would be to design a smaller network, or make an existing network smaller by employing techniques like pruning, binary weights and so on which is beyond the scope of this post. The point is though, that effective algorithmic, circuit and architecture techniques are essential to put these networks on hardware accelerators.

    Machine learning algorithms, by their very nature are data intensive, and to accelerate these processes, modern SoC floorplans will dedicate specialized architectures for ML acceleration. A power platform for reconfigurable computing is FPGA, which is increasingly finding use in machine learning hardware acceleration. Some companies are even exploring CPU-FPGA architectures. While an efficient architecture can be laid out on an FPGA with relatively low man-hour investment and capital overhead, work needs to be done to bring down the energy consumption, as data flow from DRAM to logic consumes a lot of power and dominates the power consumption of a composite CNN architecture. Ideas like moving data computation closer to the memory or "in memory computing" can aid here. Architectures will continue to evolve, and very soon, von-neumann architectures will see their silicon real estate being shared by new logic blocks designed epecially for Machine Learning tasks. There si still a long way to go but the journey has begun.

    Acknowledgement: Thanks to my colleague Isha Garg. Stimulating dsicussions with her shaped this article

    References:
    CS231n - Stanford
    Alex Krizhevsky
    Imagenet Challenge




Image segmentation, semantic segmentation, CNN, classification, CNN hardware, computer architecture