Application | Pulp |
Technology | 12 |
Manufacturer | GF |
Type | Research |
Package | Custom |
Dimensions | 10500μm x 6950μm |
Gates | 600 MGE |
Voltage | 1.2 V |
Power | 10 W @1GHz |
Clock | 1 GHz |
Occamy is a research prototype to demonstrate and explore the scalability, performance, and efficiency of our RISC-V-based architecture in a 2.5D integrated chiplet system showcasing GlobalFoundries' technologies and its IP ecosystem, as well as Rambus' and Micron's IP ecosystem.
The Occamy project started as a serendipitous outcome of the Manticore high-performance architecture concept we presented at the Hot Chips conference in 2020. After Hot Chips 2020, the PULP Platform team was approached by GlobalFoundries with an exciting proposal to turn a concept architecture into a real silicon design. The project was made possible by the generous contribution and strong support of GlobalFoundries (technology access, expert advice, ecosystem enablement, and silicon budget), Rambus (HBM2e controller IP and integration support), Micron (HBM2e DRAMs supply and integration support), Synopsys (EDA tool licenses and support) and Avery (HBM2e DRAM verification model). We kick-started the Occamy project on the 20th of April 2021 and taped out the Occamy compute chiplet in GlobalFoundries 12nm FinFet technology in July 2022 after less than 15 months of hard work with a team of only <25 people, mostly doctoral students.
In this work, we combine a small and super-efficient, in-order, 32-bit RISC-V integer core called Snitch with a large multi-precision capable floating-point unit (FPU) enhanced with single instruction multiple data (SIMD) capabilities for the following FP formats: FP64 (11,52), FP32 (8,23), FP16 (5,10), FP16alt (8,7), FP8 (5,2), FP8alt (4,3). In addition to the standard RISC-V fused multiply-accumulate (FMA) instructions, the two 8-bit and two 16-bit FP formats have the new expanding sum-dot-product and three-addend summation (exsdotp, exvsum, and vsum) instructions.
To achieve ultra-efficient computation on data-parallel FP workloads, two custom architectural extensions are exploited: data-prefetchable register file entries and repetition buffers. The corresponding RISC-V ISA extensions stream semantic registers (SSRs) and FP repetition instructions (FREP) enable the Snitch core to achieve FPU utilization higher than 90% for compute-bound kernels.
Each Occamy chiplet contains more than 216 Snitch cores organized in groups of four compute clusters. Each cluster shares a tightly-coupled memory among eight compute cores and a high-bandwidth (512-bit) DMA-enhanced core orchestrating the data flow. An AXI-based wide, multi-stage interconnect and dedicated DMA engines help manage the massive on-chip bandwidth. A CVA6 Linux-capable RISC-V core manages all compute clusters and system peripherals. Each chiplet has a private 16GB high-bandwidth memory (HBM2e) and can communicate with a neighboring chiplet over a 19.5 GB/s wide, source-synchronous technology-independent die-to-die DDR link. The dual-chiplet Occamy system achieves and estimated peak performances of 0.768 TFLOp/s for FP64, 1.536 TFLOp/s for FP32, 3.072 TFLOp/s for FP16/FP16alt, and 6.144 TFLOp/s for FP8/FP8alt.
Read more Occamy on the PULP Platform WWW Site.