Unlocking the promise of approximate computing for on-chip AI acceleration


Bruce Fleischer, Sunil Shukla


Although matrix multiplication is dominant, optimizing performance efficiency while maintaining accuracy requires the core architecture to efficiently support all of the auxiliary functions. Furthermore, our architecture offers support for native convolutional operations, allowing deep learning training and inference tasks on images and speech data to run with exceptional efficiency on the core. Credit: IBM As an illustration of how the core architecture has been optimized for a variety of deep learning functions, Figure 1 shows the breakdown of operation types within deep learning algorithms across a spectrum of application domains. The dominant matrix multiplication components are computed in the core architecture by using a customized dataflow organization of the Processing Elements shown in Figures 2 and 3 where reduced precision computations can be efficiently exploited, whereas the remaining vector functions (all of the non-red bars in Figure 1) are executed in either the Processing Elements or the Special Function Units shown in Figures 3 or 4, depending on the precision needs of the specific function. Using this testchip, built in 14LPP technology, we’ve successfully demonstrated both training and inferencing, across a broad deep learning library, exercising all operations commonly used in deep learning tasks, including matrix multiplications, convolutions and various non-linear activation functions.


Visit Link


Tags: