Meta’s new AI accelerator, MTIA 2, revealed

Meta has a new in-house AI accelerator for the company’s growing AI workloads. The new chip, called MTIA 2 or Meta Training and Inference Accelerator 2, has a different architecture than many of the newer components we’ve seen, with some clear optimizations for scalability. New Meta MTIA 2 AI Accelerator Revealed

The chip is designed with an 8×8 array of processing elements, or PEs, in the center. On the outer edge are other features such as the host interface, a memory controller for the LPDDR5 memory that sits around the accelerator, and the frame. In the diagram we can also see a control core and a decompression engine. In STH we often talk about which part of a chip is used for non-computing tasks. This is a great example as it is easy to see the PE and non-PE areas. It also shows how much AI-focused projects focus on memory and data movement. The MTIA 2 accelerators are integrated into the chassis and are very different from what NVIDIA does with their own accelerators. Each card has two 90W accelerators so it can be air-cooled. With a PCIe Gen5 x8 connection to the host, the two accelerators can share a x16 edge connector. There are twelve plates in each chassis, or 24 per chassis. Meta says it deploys them in groups of three for 72 accelerators with an option for an RDMA NIC. While we generally credit AWS Nitro for sparking the interest in DPUs, Facebook/Meta made a major innovation years ago by deploying multi-host adapters at scale to reduce networking costs. In this case the accelerator to NIC ratio is much lower than that of NVIDIA systems.

Here are Meta’s key accelerator performance specifications:

TOP GEMMA

708 TFLOPS/s (INT8) (diffusion)
354 TFLOPS/s (INT8)
354 TFLOPS/s (FP16/BF16) (diffusion)
177 TFLOPS/s (FP16/BF16)

SIMD PLANES

Vector Core:
11.06 TFLOPS/s (INT8),
5.53 TFLOPS/s (FP16/BF16),
2.76 TFLOPS/s (FP32)
SIMD:
5.53 TFLOPS/s (INT8/FP16/BF16),
2.76 TFLOPS/s (FP32)

Memory Capacity

Local Memory: 384 KB per PE
Integrated Memory: 256 MB
Off-chip LPDDR5: 128 GB

Memory Bandwidth

Local Memory: 1 TB/s per PE
On-chip Memory: 2.7 TB/s
Off-chip LPDDR5: 204.8 GB/s (Source: Meta)

Something that stands out is the memory capacity to watt ratio of the AI ​​accelerator. LPDDR5 memory may not offer huge amounts of bandwidth compared to HBM-based accelerators, but it does offer relatively high capacity. Meta has something like 128GB of memory / 90W TDP for about 1.42GB/W. If we compare this to the Intel Gaudi 3 we showed off this week with 128GB of HBM2E and a 900W TDP of 0.142GB/W, it’s clear that the Meta is targeting a different ratio of memory capacity to compute performance than many other chips we’ve seen. Final Words

Meta’s post on MTIA 2 talks about hardware and software co-design. The company has applications large enough to produce dedicated accelerators. It seems a bit like they are showing that this is less about production impact and more about being a recruiting tool. Meta has been a leader in AI and buys a lot of compute and memory, so it makes a lot of sense that they are exploring different architectures. Here is the MTIA 2 chip measuring about 421mm2 in TSMC 5. Meta says the chip has about 2.35 billion gates.

Was this article useful to you?


0 Feedbacks

Users comments


Abgineh Pardaz Shargh