Thread 01: Merged Logic and Memory Fabrics

Of great interest are hardware architectures comprised of new technologies that are (i) compatible with existing MOSFET technology and (ii) enable merged logic and memory fabrics where logic and the memory that stores data for said logic are integrated at finer granularities, and not on separate chips. These fabrics can support an important computation in ML algorithms such as multiply-accumulate (MAC) operations (y=wixi – where xi represents an input pixel value, and wi represents a learned filter weight). Recent work suggests that a single MAC operation requires ~3.2 pJ of energy. However, if a weight value wi is stored in off-chip memory and brought to the processor for computation, 640 pJ of energy are required just to fetch the filter value! The energy of the memory request overwhelms that of the computation. Co-locating more weight data with logic for ML computations could enable more sophisticated algorithms in battery-powered edge devices in the IoT. Merged logic and memory fabrics may also lead to more efficient encryption algorithms, non-volatile processors, etc.

Examples of projects – and application drivers – associated with this research thread include but are not limited to:

Content Addressable Memories

Content addressable memories (CAMs) are used to find the best match to a new query in O(1) time. A CAM returns the address of the entry (or entries) with the closest match to the supplied data, (and hence the subsequent classification) by performing an XNOR operation in each cell indicating a match/mismatch.

We are working to apply these memories to support emerging machine models (e.g., from Google DeepMind). More specifically, deep neural networks are efficient at learning from large sets of labelled data, but struggle to adapt to previously unseen data. In pursuit of generalized artificial intelligence, one approach is to augment neural networks with an attentional memory so that they can draw on already learnt knowledge patterns and adapt to new but similar tasks. In current implementations of such memory augmented neural networks (MANNs), the content of a network’s memory is typically transferred from the memory to the compute unit (a central processing unit or graphics processing unit) to calculate similarity or distance norms. The processing unit hardware incurs substantial energy and latency penalties associated with transferring the data from the memory and updating the data at random memory addresses. We have shown that ternary content-addressable memories (TCAMs) can be used as attentional memories, in which the distance between a query vector and each stored entry is computed within the memory itself, thus avoiding data transfer. We have developed compact and energy-efficient TCAM cells based on two ferroelectric field-effect transistors. We evaluate the performance of our ferroelectric TCAM array prototype for one- and few-shot learning applications. When compared with a MANN where cosine distance calculations are performed on a graphics processing unit, the ferroelectric TCAM approach provides a 60-fold reduction in energy and 2,700-fold reduction in latency for a single memory search operation.

Representative work: Kai Ni, Xunzhao Yin, Ann Franchesa Laguna, Siddharth Joshi, Stefan Dunkel, Martin Trentzsch, Johannes Mueller, Sven Beyer, William Taylor, Michael T. Niemier, Xiaobo Sharon Hu, and Suman Datta, “Ferroelectric Ternary Content Addressable Memory for One-Shot Learning,” Nature Electronics, 2(11), p. 521-529, 2019.

Compute-in-Memory Architectures

Homomorphic encryption allows direct computations on encrypted data. Despite numerous research efforts, the practicality of HE schemes still remains to be demonstrated. In this regard, the enormous size of ciphertexts involved in HE computations necessitates a high volume of data, which degrades computational efficiency. Near-memory Processing (NMP) and Computing-in-memory (CiM) – paradigms where computation is done within the memory boundaries – represent architectural solutions for reducing latency and energy associated with data transfers in data-intensive applications such as HE. Prior research show that these paradigms could support searches on encrypted data with less runtime and energy consumption than CPU or ASIC solutions, but no CiM or NMP designs exist for HE schemes for general computation. We are working to develop, an architecture that can support operations for the B/FV scheme, a somewhat homomorphic encryption scheme for general computation. CiM-HE hardware consists of customized peripherals such as sense amplifiers, adders, bit-shifters, and sequencing circuits. The peripherals are based on CMOS technology, and could support computations with memory cells of different technologies.

Representative work: Dayane Reis, Michael Niemier, X. Sharon Hu, “A Computing-in-Memory Engine for Searching on Homomorphically Encrypted Data,” in IEEE JxCDC, 2019, DOI: 10.1109/JXCDC.2019.2931889

Fine-grained Logic-in-Memory

As we approach the limits of CMOS scaling, researchers are developing “beyond-CMOS” technologies to sustain the technological benefits associated with device scaling. Spintronic technologies have emerged as a promising beyond-CMOS technology due to their inherent benefits over CMOS such as high-integration density, low leakage power, radiation hardness, and non-volatility. These benefits make spintronic devices an attractive successor to CMOS – especially for memory circuits. However, spintronic devices generally suffer from slower switching speeds and higher write energy, which limits their usability. In an effort to close the energy-delay gap between CMOS and spintronics, device concepts such as CoMET (Composite-Input Magnetoelectric-base Logic Technology) have been introduced, which collectively leverage material phenomena such as the spin-Hall effect and the magnetoelectric effect to enable fast, energy efficient device operation. We have developed structures such as non-volatile flip-flop (NVFF) based on CoMET technology that can achieve up to two orders of magnitude less write energy than CMOS. This low write energy (~2 aJ) makes our CoMET NVFF especially attractive to architectures that require frequent backup operations – e.g., for energy harvesting non-volatile processors.

Representative work: Robert Perricone@, Zhaoxin Liang, Meghna Mankalale, Michael Niemier, Sachin S. Sapatnekar, Jian-Ping Wang and X, Sharon Hu, “An Energy Efficient Non-Volatile Flip-Flop based on CoMET Technology,” at Design Automation and Test in Europe (DATE), 2019.