New system is a mix of traditional camera and one that only highlights changes.
Elon Musk, back in October 2021, tweeted that “humans drive with eyes and biological neural nets, so cameras and silicon neural nets are only way to achieve generalized solution to self-driving.” The problem with his logic has been that human eyes are way better than RGB cameras at detecting fast-moving objects and estimating distances. Our brains have also surpassed all artificial neural nets by a wide margin at general processing of visual inputs.
To bridge this gap, a team of scientists at the University of Zurich developed a new automotive object-detection system that brings digital camera performance that’s much closer to human eyes. “Unofficial sources say Tesla uses multiple Sony IMX490 cameras with 5.4-megapixel resolution that [capture] up to 45 frames per second, which translates to perceptual latency of 22 milliseconds. Comparing [these] cameras alone to our solution, we already see a 100-fold reduction in perceptual latency,” says Daniel Gehrig, a researcher at the University of Zurich and lead author of the study.
Replicating human vision
When a pedestrian suddenly jumps in front of your car, multiple things have to happen before a driver-assistance system initiates emergency braking. First, the pedestrian must be captured in images taken by a camera. The time this takes is called perceptual latency—it’s a delay between the existence of a visual stimuli and its appearance in the readout from a sensor. Then, the readout needs to get to a processing unit, which adds a network latency of around 4 milliseconds.
The processing to classify the image of a pedestrian takes further precious milliseconds. Once that is done, the detection goes to a decision-making algorithm, which takes some time to decide to hit the brakes—all this processing is known as computational latency. Overall, the reaction time is anywhere between 0.1 to half a second. If the pedestrian runs at 12 km/h they would travel between 0.3 and 1.7 meters in this time. Your car, if you’re driving 50 km/h, would cover 1.4 to 6.9 meters. In a close-range encounter, this means you’d most likely hit them.
Gehrig and Davide Scaramuzza, a professor at the University of Zurich and a co-author on the study, aimed to shorten those reaction times by bringing the perceptual and computational latencies down.
The most straightforward way to lower the former was using standard high-speed cameras that simply register more frames per second. But even with a 30-45 fps camera, a self-driving car would generate nearly 40 terabytes of data per hour. Fitting something that would significantly cut the perceptual latency, like a 5,000 fps camera, would overwhelm a car’s onboard computer in an instant—the computational latency would go through the roof.
So, the Swiss team used something called an “event camera,” which mimics the way biological eyes work. “Compared to a frame-based video camera, which records dense images at a fixed frequency—frames per second—event cameras contain independent smart pixels that only measure brightness changes,” explains Gehrig. Each of these pixels starts with a set brightness level. When the change in brightness exceeds a certain threshold, the pixel registers an event and sets a new baseline brightness level. All the pixels in the event camera are doing that continuously, with each registered event manifesting as a point in an image.
This makes event cameras particularly good at detecting high-speed movement and allows them to do so using far less data. The problem with putting them in cars has been that they had trouble detecting things that moved slowly or didn’t move at all relative to the camera. To solve that, Gehrig and Scaramuzza went for a hybrid system, where an event camera was combined with a traditional one.
ARS VIDEO
How Lighting Design In The Callisto Protocol Elevates The Horror
Building hybrid vision
“Our system is a deep learning-based object detector that uses both images and events as input and generates object detections,” says Gehrig. It works just as any other automotive object-detection system, by marking objects in the images with so-called bounding boxes and identifying them as cars, pedestrians, and so on. The novelty is that there are actually two detectors: one working with the images fed by the standard camera and the other with events coming from the event camera.
“Each time an image is recorded, the image-based detector will detect traffic participants, i.e. generate 2D bounding boxes. Then, with events, the event-based detector refines these bounding boxes at a high frequency, and new detections are made when pedestrians become suddenly visible. The event-based detector builds on the result already obtained by the image detector, and also finds new objects,” Gehrig explains. Basically, the event camera detects movement happening in between frames registered by the standard camera. And this makes the whole thing astoundingly fast.
In the pedestrian scenario, the event camera would be the first to notice something dangerous is going on. “It will generate ‘events’ associated with the visible parts of the pedestrian, and this will happen within 0.2 milliseconds of the pedestrian appearing,” says Gehrig. Network latency between the camera and the processing unit stays the same. “While running on conventional hardware, [identifying an object ] takes around 10 milliseconds,” Gehrig explains. This means a pedestrian should be recognized by the system within 14.2 milliseconds—up to 7.8 milliseconds before the Tesla’s Sony IMX490 camera would have generated the first frame with them present. Of course, the decision-making algorithm would still need a few additional milliseconds to compute a reaction.Advertisement
Gehrig estimates that the latency of this hybrid system is comparable to a 5,000 fps high-speed camera but only needs the bandwidth of a 45 fps camera. And, according to him, this is going to get even faster.
Vision aids and neuromorphic chips
“Future work will focus on deploying this on specialized hardware to speed up the event-based detector,” Gehrig says. Because the data stream generated by event cameras is continuous and asynchronous—every pixel reports events independently—traditional processors have trouble dealing with it efficiently. But the feed from event cameras aligns particularly well with neuromorphic computing.
“A current limitation is that our system requires specialized hardware to run efficiently and fast. However, we believe that with suitable FPGA-based solutions, or neuromorphic chips like the Kraken, Speck, or Intel Loihi, we could make this a reality,” Gehrig claims.
But does it mean Musk was right, and all other big players in the industry were wrong, and we can pull off vision-only autonomous driving? Not exactly.
Other sensors, particularly lidar, are still very much useful. According to Gehrig’s team, lidar can be integrated in their system to increase performance and reduce complexity, as the algorithms would not need to bother with estimating distances to detected objects. Same goes for lidar alternatives like the Clarity system covered by Ars three years ago, which used trigonometry to precisely calculate the distance to every pixel a camera saw. “I think our solution would be compatible with that. This would greatly enhance the accuracy of our method, since calculating distances can also give us concrete information about the shape and size of traffic participants, which is difficult to learn from events and images alone,” Gehrig says. “So our plans could include the extension with more sensors like lidars,” he adds.
Nature, 2024. DOI: 10.1038/s41586-024-07409-w