Click here to flash read.
Machine learning (ML) accelerators have been studied and used extensively to
compute ML models with high performance and low power. However, designing such
accelerators normally takes a long time and requires significant effort.
Unfortunately, the pace of development of ML software models is much faster
than the accelerator design cycle, leading to frequent and drastic
modifications in the model architecture, thus rendering many accelerators
obsolete. Existing design tools and frameworks can provide quick accelerator
prototyping, but only for a limited range of models that can fit into a single
hardware device, such as an FPGA. Furthermore, with the emergence of large
language models, such as GPT-3, there is an increased need for hardware
prototyping of these large models within a many-accelerator system to ensure
the hardware can scale with the ever-growing model sizes. In this paper, we
propose an efficient and scalable approach for exploring accelerator systems to
compute large ML models. We developed a tool named MASE that can directly map
large ML models onto an efficient streaming accelerator system. Over a set of
ML models, we show that MASE can achieve better energy efficiency to GPUs when
computing inference for recent transformer models. Our tool will open-sourced
upon publication.