A four core PULP implementation using third generation or10n cores. This chip has many improvements over Mia Wallace, which was manufactured using the same technology and has a similar size. It has 192 Kbytes of L2 memory and 64 Kbytes of TCDM.
- New instructions for vector processing, and fixed point arithmetic using Q15 and Q31
- Dot product and accumulate between two vectors
- Multiply, accumulate and shift
- Multiply and subtract
- Clip, used for saturation
- Add subtract with normalization
- Bit set, clear, extract
- Shuffle
- Improvements to the DMA
- Added support for multiple transfers IDs on the same private queue.
- Added separate queues for linear and 2d transfers to optimize area of the command queue.
- Added support for non-incrementing bursts.
- Added support for 2D transfers.
- Improvements to the HW Convolution Engine
- Support for multiple input/output features
- Vectorized convolutions with reduced precison weights. For example, the HWCE can compute 1 pixel/cycle for four features with 4-bit weigths, or 2 features with 8-bit weights, or a single feature with 16-bit weights.
- Optimized bandwidth utilization
- Optimized power consumption by fine-grained architectural clock gating
- Added a Cryptographic accelerator
- Developed an I/O DMA to enable direct memory transfers from peripherals to L2 memory when the cluster is idle. On the peripherals side the I/O DMA connects the I2S (master and slave), I2C, SPI (master and slave) and UART. The I/O DMA is connected to the L2 memory through a high priority port, avoiding the need for large internal FIFOs.
- Improved power management architecture. Several power modes have been implemented for cluster and SoC.