We think well probably have a prototype sometime next year. That basically looks like this and its intended to be friendly, of course, and navigate through a world built for humans and eliminate dangerous, repetitive and boring tasks were setting it such that it is um at a mechanical level. At a physical level, you can run away from it and and most likely overpower it so uh, hopefully that doesnt ever happen, but um. You never know its a round. Uh 5.8 um uh has sort of a screen where the head is for useful information um, but its otherwise basically got the autopilot system in it. So its uh got cameras got eight cameras and um yeah uh. What we want to show today is that uh tesla is much more than an electric car company uh that we have uh deep ai activity, uh in uh hardware, on the inference level on the training, level, um and uh. Basically, we i think, were, i think, arguably the leaders in real world ai as it applies to real world um and those of you who have seen the full self driving beta. I can appreciate the rate at which the tesla neural net is learning to drive. So here im showing the video of the raw inputs that come into the stack and then neural processes that into the vector space and you are seeing parts of that vector space rendered in the instrument cluster on the car. Now, what i find kind of fascinating about this is that we are effectively building a synthetic animal from the ground up, so the car can be thought of as an animal.

It moves around it, senses the environment and you know, acts autonomously and intelligently and we are building all the components from scratch in house. So we are building, of course, all the mechanical components of the body, the nervous system, which is all the electrical components and, for our purposes, the brain of the autopilot and specifically, for this section, the synthetic visual cortex. We are processing just individual image and were making a large number of predictions about these images. So, for example, here you can see predictions of the stop signs. Uh the stop lines, the lines, the edges, the cars uh, the traffic lights, the curbs here uh, whether or not the car is parked uh. All of the static objects like trash cans, cones and so on, and everything here is coming out of the net here in this case out of the hydra net. So that was all fine and great, but as we worked towards fsd, we quickly found that this is not enough, so where this first started to break was when we started to work on smart summon here, i am showing some of the predictions of only the curve Detection task and im showing it now for every one of the cameras, so wed like to wind our way around the parking lot to find the person who is summoning the car. Now the problem is that you cant just directly drive on image space predictions. You actually need to cast them out and form some kind of a vector space around you, so we attempted to do this using c plus and developed what we call the occupancy tracker at the time.

So here we see that the curve detections from the images are being stitched up across camera, scenes camera boundaries and over time now there are two pro two major problems. I would say, with the setup number one: we very quickly discovered that tuning the occupancy tracker and all of its hyper parameters was extremely complicated. You dont want to do this explicitly by hand in c plus, plus you want this to be inside the neural network and train that end to end number. Two. We very quickly discovered that the image space is not the correct output space uh you dont, want to make predictions in the image space. You really want to make it directly in the vector space. So, for example, here in this video im showing single camera predictions in orange and multi camera predictions in blue and basically, if you, if you cant, predict these cars, if you are only seeing a tiny sliver of a car, so your detections are not going to be Very good and their positions are not going to be good, but a multi camera network does not have an issue heres another video from a more nominal sort of situation, and we see that as these cars in this tight space cross camera boundaries theres a lot of Jank that enters into the predictions and basically the whole setup, just doesnt, make sense, especially for very large vehicles like this one, and we can see that the multi camera networks struggle significantly less with these kinds of predictions.

So here we are making predictions about the road boundaries in red intersection areas in blue road, centers and so on, so were only showing a few of the predictions here just to keep the visualization clean, um and yeah. This is, this is done by the spatial rnn, and this is only showing a single clip single traversal, but you can imagine there could be multiple trips through here and basically number of cars. A number of clips could be collaborating to build this map basically and effectively. An hd map, except its not in the space of explicit items, its in a space of features of a recurring neural network which is kind of cool. I havent seen that before so heres putting everything together. This is what our architectural rougher looks like today. So um. We have raw images feeding on bottom. They go through rectification, layer to correct for camera calibration and put everything into a common uh virtual camera. We pass them through regnets residual networks to process them into a number of features at different scales. We fuse the multi scale. Information with by fpn this goes through a transformer module to re, represent it into the vector space in the output space. This feeds into a feature queue in time or space that gets processed by a video module like the spatial rnn and then continues into the branching structure of the hydra net, with trunks and heads for all the different tasks. So here were trying to do a lane.

Change in this case, the car needs to do two back to back lane changes to make the left turn up ahead for this, the car searches over uh different maneuvers um, so the first, the first one – it searches – is a lane change, thats close by, but the Uh car breaks pretty harshly so its pretty uncomfortable the next maneuver it tries thats the lane change bit late, so it speeds up, goes behind the other. Car goes in front of the other cars and find it as the lane change, but now it risks missing. The left turn we do thousands of such searches in a very short time span, because these are all physics based models. These futures are very easy to simulate and in the end, we have a set of candidates and we finally choose one based on the optimality conditions of safety, comfort and easily making the turn. So now the car has chosen this path and you can see that as the car executes this trajectory it pretty much matches what we had planned. The cyan plot on the right side here and that one is the actual velocity of the car and the white line. Be underneath it is, was a plan, so we are able to plan for 10 seconds here and able to match that when we see in hindsight, so this is a well made plan. So a single car driving through some location can sweep out some patch around the trajectory using this technique, but we dont have to stop there.

So here we collect collected different clips uh from the same location from different cars. Maybe, and each of them sweeps out some part of the road cool thing is we can bring them all together into a single giant optimization. So here these 16 different trips are organized using a line using various features such as roadages lane lines. All of them should agree with each other and also agree with all of their image space observations together. This is this, produces an effective way to label the road surface, not just where the car drove, but also in other locations that it hasnt driven it. We dont have to stop at just the road surface. We can also arbitrarily reconstruct 3d static obstacles um here this is a reconstructed 3d point cloud from our cameras. The main innovation here is the density of the point cloud. Typically, these points require texture uh to form associations from one frame to the next frame, but here we are able to produce these points even on textureless surfaces like the road surface or walls uh, and this is really useful to annotate arbitrary obstacles that we can see On the scene in the world combining everything together, we can produce these amazing data, sets that annotate um all of the road texture or the static objects and all of the moving objects, even through occlusions, producing excellent kinematic labels. If we put all of it together, we get training optimized chip, our d1 chip.

This was entirely designed by tesla team internally, all the way from the architecture to gds out and package. This chip is like a gpu level, compute with a cpu level, flexibility and twice the network chip level. I o bandwidth, but we didnt stop here. We integrated the entire electrical thermal and mechanical pieces out here to form our training tile fully integrated interfacing with a 52 volt dc input its unprecedented. This is an amazing piece of engineering. Our compute plane is completely orthogonal to power, supply and cooling that makes high bandwidth compute planes possible. What it is is a nine petaflop training tile.

https://www.youtube.com/watch?v=p_ubwnA9428