DOI: 10.1145/3822173 ISSN: 1544-3566

RevaMp3D: Architecting the Processor Core and Cache Hierarchy for Systems with Monolithically-Integrated Logic and Memory

Nika Mansouri Ghiasi, Mohammad Sadrosadati, Geraldo Oliveira, Konstantinos Kanellopoulos, Rachata Ausavarungnirun, Juan Gómez Luna, João Ferreira, Jeremie S. Kim, Christina Giannoula, Nandita Vijaykumar, Jisung Park, Onur Mutlu

Recent nano-technological advances enable the Monolithic 3D (M3D) integration of multiple memory and logic layers in a single chip . Such integration enables high-bandwidth connections between layers, which significantly alleviates main memory bottlenecks. We show for a variety of workloads, on a state-of-the-art M3D-based system, that the performance and energy bottlenecks shift from main memory to the processor core and cache hierarchy. Therefore, to effectively utilize the chip’s area, given the applications’ shifted requirements, there is a need to revisit current processor core and cache hierarchy designs that have been conventionally tailored to tackle the memory bottleneck. Based on the insights from our design space exploration, we propose RevaMp3D , which introduces five key changes to the state-of-the-art M3D-based system. First, we propose removing the shared last-level cache, based on our observation that doing so achieves speedups on par with, or even exceeding, speedups achieved by increasing its size or reducing its latency, across all our workloads. Second, since our analysis shows that improving L1 cache latency has a large impact on performance in M3D, we reduce L1 cache latency by leveraging an M3D layout that reduces wire lengths. Third, we leverage the area reclaimed from removing large caches to widen and scale up various structures in the processor pipeline that enable greater instruction-level parallelism. To avoid latency penalties from these larger structures, we leverage M3D layouts that keep their wire lengths short. Fourth, to facilitate high thread-level parallelism, we propose a new fine-grained synchronization technique, using M3D’s dense inter-layer connectivity. Fifth, we leverage the M3D main memory to mitigate the performance and energy bottlenecks of the processor core. To this end, we propose a processor frontend design that memoizes the repetitive fetched, decoded, and reordered instructions, stores them in main memory at low cost, and turns off the relevant parts of the core when possible. The high-bandwidth, energy-efficient M3D memory enables storing and loading the memoized instructions efficiently, eliminating the need for large SRAM for storing the instructions. Our evaluation using a wide range of 20 real-world workloads and 7 multi-programmed mixes shows that RevaMp3D provides 1.2 × –2.9 × speedup and 1.2 × –1.4 × energy reduction, while achieving 12.3% smaller area, compared to the state-of-the-art M3D-based system. Compared to the state-of-the-art 3D and 2D systems, RevaMp3D provides 4.96 × and 7.14 × average speedup, respectively. We also analyze the impact of RevaMp3D’s design decisions for M3D systems with various main memory latency values since latency can vary depending on the design decisions made to meet certain requirements of the target system. This analysis facilitates making the appropriate design decisions based on latency, thereby benefiting a wide range of workloads.

More from our Archive