A very unusual chip design provides a distinct approach to cutting AI’s energy cost.
As the utility of AI systems has grown dramatically, so has their energy demand. Training new systems is extremely energy intensive, as it generally requires massive data sets and lots of processor time. Executing a trained system tends to be much less involved—smartphones can easily manage it in some cases. But, because you execute them so many times, that energy use also tends to add up.
Fortunately, there are lots of ideas on how to bring the latter energy use back down. IBM and Intel have experimented with processors designed to mimic the behavior of actual neurons. IBM has also tested executing neural network calculations in phase change memory to avoid making repeated trips to RAM.
Now, IBM is back with yet another approach, one that’s a bit of “none of the above.” The company’s new NorthPole processor has taken some of the ideas behind all of these approaches and merged them with a very stripped-down approach to running calculations to create a highly power-efficient chip that can efficiently execute inference-based neural networks. For things like image classification or audio transcription, the chip can be up to 35 times more efficient than relying on a GPU.
A very unusual processor
It’s worth clarifying a few things early here. First, NorthPole does nothing to help the energy demand in training a neural network; it’s purely designed for execution. Second, it is not a general AI processor; it’s specifically designed for inference-focused neural networks. As noted above, inferences include things like figuring out the contents of an image or audio clip so they have a large range of uses, but this chip won’t do you any good if your needs include running a large language model.
Finally, while NorthPole takes some ideas from neuromorphic computing chips, including IBM’s earlier TrueNorth, this is not neuromorphic hardware, in that its processing units perform calculations rather than attempt to emulate the spiking communications that actual neurons use.Advertisement
That’s what it’s not. What actually is NorthPole? Some of the ideas do carry forward from IBM’s earlier efforts. These include the recognition that a lot of the energy costs of AI come from the separation between memory and execution units. Since a key component of neural networks—the weight of connections between different layers of “neurons”—is held in memory, any execution on a traditional processor or GPU burns a lot of energy simply getting those weights from memory to where they can be used during execution.
So NorthPole, like TrueNorth before it, consists of a large array (16×16) of computational units, each of which includes both local memory and code execution capacity. So, all of the weights of various connections in the neural network can be stored exactly where they’re needed.
Another feature is extensive on-chip networking, with at least four distinct networks. Some of these carry information from completed calculations to the compute units where they’re needed next. Others are used to reconfigure the entire array of compute units, providing the neural weights and code needed to execute one layer of the neural network while the calculations of the previous layer are still in progress. Finally, communications among neighboring compute units is optimized. This can be useful for things like finding the edge of an object in an image. If the image is entered so that neighboring pixels go to neighboring compute units, they can more easily cooperate to identify features that extend across neighboring pixels.
The computing resources are unusual as well. Each unit is optimized for performing lower-precision calculations, ranging from two- to eight-bit precision. While higher precision is often required for training, the values needed during execution generally don’t require that level of exactitude. To keep those execution units in use, they are incapable of performing conditional branches based on the value of variables—meaning your code cannot contain an “if” statement. This eliminates the need for the hardware needed for speculative branch execution, and it ensures that the wrong code will be executed whenever that speculation turns out to be wrong.
This simplicity in execution makes each compute unit capable of massively parallel execution. At two-bit precision, each unit can perform over 8,000 calculations in parallel.
Software, too
Because of all these distinctive design choices, the team behind NorthPole had to develop its own training software that figures out things like the minimum level of precision that’s necessary at each layer to operate successfully.
Executing neural networks on the chip is also a relatively unusual process. Once the weights and connections of the neural network are placed in buffers on the chip, execution simply requires an external controller—typically a CPU—to upload the data it’s meant to operate on (such as an image) and tell it to start. Everything else runs to completion without the CPU’s involvement, which should also limit the system-level power consumption.
The NorthPole test chips were built on a 12 nm process, which is well behind the cutting edge. Still, they managed to fit 256 computational units, each with 768 kilobytes of memory, onto a 22 billion transistor chip. When the system was run against an Nvidia V100 Tensor Core GPU that was fabricated using a similar process, they found that NorthPole managed to perform 25 times the calculations for the same amount of power. And it could outperform a cutting-edge GPU by about fivefold using the same measure. Tests with the system showed it could perform a range of widely used neural network tasks efficiently, as well.Advertisement
While the tests were run with the NorthPole processor installed on a PCIe card, IBM told Ars that the chip is still viewed as a research prototype, and additional work would be needed to convert it into a commercial product. The company did not indicate whether it would be pursuing commercialization, though.
One of the potential limitations of the system is that it can only run neural networks that fit within its hardware. Put too many nodes in a single layer, and NorthPole cannot deal with it. But there is the possibility of splitting up layers and executing segments of them on multiple NorthPole chips in parallel. The hardware has the capacity to handle this, but it hasn’t been tested as of yet.
Perhaps the biggest limitation, however, is that this is specialized for a single category of AI task. While it’s a commonly used one, the efficiency here comes largely from designing hardware that’s a good match to the type of execution needed by inference tasks. So, while it’s good to see the effort put into dropping the power demands of some AI workloads, we’re not at the point yet where we can have a single accelerator that works for all cases.