Collaborate on Hardware Bill Format for Education with Ease Using airSlate SignNow

See your invoicing process become fast and effortless. With just a few clicks, you can complete all the necessary actions on your hardware bill format for Education and other important files from any gadget with internet access.

Award-winning eSignature solution

Send my document for signature

Get your document eSigned by multiple recipients.
Send my document for signature

Sign my own document

Add your eSignature
to a document in a few clicks.
Sign my own document

Move your business forward with the airSlate SignNow eSignature solution

Add your legally binding signature

Create your signature in seconds on any desktop computer or mobile device, even while offline. Type, draw, or upload an image of your signature.

Integrate via API

Deliver a seamless eSignature experience from any website, CRM, or custom app — anywhere and anytime.

Send conditional documents

Organize multiple documents in groups and automatically route them for recipients in a role-based order.

Share documents via an invite link

Collect signatures faster by sharing your documents with multiple recipients via a link — no need to add recipient email addresses.

Save time with reusable templates

Create unlimited templates of your most-used documents. Make your templates easy to complete by adding customizable fillable fields.

Improve team collaboration

Create teams within airSlate SignNow to securely collaborate on documents and templates. Send the approved version to every signer.

See airSlate SignNow eSignatures in action

Create secure and intuitive eSignature workflows on any device, track the status of documents right in your account, build online fillable forms – all within a single solution.

Try airSlate SignNow with a sample document

Complete a sample document online. Experience airSlate SignNow's intuitive interface and easy-to-use tools
in action. Open a sample document to add a signature, date, text, upload attachments, and test other useful functionality.

sample
Checkboxes and radio buttons
sample
Request an attachment
sample
Set up data validation

airSlate SignNow solutions for better efficiency

Keep contracts protected
Enhance your document security and keep contracts safe from unauthorized access with dual-factor authentication options. Ask your recipients to prove their identity before opening a contract to hardware bill format for education.
Stay mobile while eSigning
Install the airSlate SignNow app on your iOS or Android device and close deals from anywhere, 24/7. Work with forms and contracts even offline and hardware bill format for education later when your internet connection is restored.
Integrate eSignatures into your business apps
Incorporate airSlate SignNow into your business applications to quickly hardware bill format for education without switching between windows and tabs. Benefit from airSlate SignNow integrations to save time and effort while eSigning forms in just a few clicks.
Generate fillable forms with smart fields
Update any document with fillable fields, make them required or optional, or add conditions for them to appear. Make sure signers complete your form correctly by assigning roles to fields.
Close deals and get paid promptly
Collect documents from clients and partners in minutes instead of weeks. Ask your signers to hardware bill format for education and include a charge request field to your sample to automatically collect payments during the contract signing.
Collect signatures
24x
faster
Reduce costs by
$30
per document
Save up to
40h
per employee / month

Our user reviews speak for themselves

illustrations persone
Kodi-Marie Evans
Director of NetSuite Operations at Xerox
airSlate SignNow provides us with the flexibility needed to get the right signatures on the right documents, in the right formats, based on our integration with NetSuite.
illustrations reviews slider
illustrations persone
Samantha Jo
Enterprise Client Partner at Yelp
airSlate SignNow has made life easier for me. It has been huge to have the ability to sign contracts on-the-go! It is now less stressful to get things done efficiently and promptly.
illustrations reviews slider
illustrations persone
Megan Bond
Digital marketing management at Electrolux
This software has added to our business value. I have got rid of the repetitive tasks. I am capable of creating the mobile native web forms. Now I can easily make payment contracts through a fair channel and their management is very easy.
illustrations reviews slider
walmart logo
exonMobil logo
apple logo
comcast logo
facebook logo
FedEx logo
be ready to get more

Why choose airSlate SignNow

  • Free 7-day trial. Choose the plan you need and try it risk-free.
  • Honest pricing for full-featured plans. airSlate SignNow offers subscription plans with no overages or hidden fees at renewal.
  • Enterprise-grade security. airSlate SignNow helps you comply with global security standards.
illustrations signature

Explore how to simplify your process on the hardware bill format for Education with airSlate SignNow.

Looking for a way to optimize your invoicing process? Look no further, and adhere to these simple guidelines to easily collaborate on the hardware bill format for Education or ask for signatures on it with our user-friendly platform:

  1. Сreate an account starting a free trial and log in with your email credentials.
  2. Upload a file up to 10MB you need to eSign from your device or the online storage.
  3. Proceed by opening your uploaded invoice in the editor.
  4. Take all the required actions with the file using the tools from the toolbar.
  5. Click on Save and Close to keep all the modifications performed.
  6. Send or share your file for signing with all the necessary addressees.

Looks like the hardware bill format for Education workflow has just turned more straightforward! With airSlate SignNow’s user-friendly platform, you can easily upload and send invoices for electronic signatures. No more generating a printout, manual signing, and scanning. Start our platform’s free trial and it enhances the whole process for you.

How it works

Upload a document
Edit & sign it from anywhere
Save your changes and share

airSlate SignNow features that users love

Speed up your paper-based processes with an easy-to-use eSignature solution.

Edit PDFs
online
Generate templates of your most used documents for signing and completion.
Create a signing link
Share a document via a link without the need to add recipient emails.
Assign roles to signers
Organize complex signing workflows by adding multiple signers and assigning roles.
Create a document template
Create teams to collaborate on documents and templates in real time.
Add Signature fields
Get accurate signatures exactly where you need them using signature fields.
Archive documents in bulk
Save time by archiving multiple documents at once.
be ready to get more

Get legally-binding signatures now!

FAQs

Here is a list of the most common customer questions. If you can’t find an answer to your question, please don’t hesitate to reach out to us.

Need help? Contact support

What active users are saying — hardware bill format for education

Get access to airSlate SignNow’s reviews, our customers’ advice, and their stories. Hear from real users and what they say about features for generating and signing docs.

Everything has been great, really easy to incorporate...
5
Liam R

Everything has been great, really easy to incorporate into my business. And the clients who have used your software so far have said it is very easy to complete the necessary signatures.

Read full review
I couldn't conduct my business without contracts and...
5
Dani P

I couldn't conduct my business without contracts and this makes the hassle of downloading, printing, scanning, and reuploading docs virtually seamless. I don't have to worry about whether or not my clients have printers or scanners and I don't have to pay the ridiculous drop box fees. Sign now is amazing!!

Read full review
airSlate SignNow
5
Jennifer

My overall experience with this software has been a tremendous help with important documents and even simple task so that I don't have leave the house and waste time and gas to have to go sign the documents in person. I think it is a great software and very convenient.

airSlate SignNow has been a awesome software for electric signatures. This has been a useful tool and has been great and definitely helps time management for important documents. I've used this software for important documents for my college courses for billing documents and even to sign for credit cards or other simple task such as documents for my daughters schooling.

Read full review

Related searches to Collaborate on hardware bill format for Education with ease using airSlate SignNow

Hardware bill format for education pdf
Simple hardware bill format for education
Hardware bill format for education pdf download
Hardware bill format for education excel
Hardware bill format for education free download
Free hardware bill format for education
Education invoice template
Tutoring invoice template free
video background

Hardware bill format for Education

LUIS CEZE: All right, welcome everyone. It's really a pleasure here to host our distinguished lecture series with Bill Dally today. So Bill Dally is a real computer architect. He's built chips in academia and in industry. He has taught at Caltech, MIT, and then Stanford before going to NVIDIA as chief scientist and head of research there. He has done pioneering work in a bunch of techniques in parallel computing that you probably use today. And it's nice to see that with AI today, like this is what's enabled parallel-- what enables AI is parallel computing. So he's a member of National Academy-- I'm going to have to read here because this is so long so-- National Academy of Engineering, fellow of AAAS, fellow of the IEEE and the ACM, and also won all of the major computer architecture awards, including Eckert-Mauchly Award, the highest honor we have in computer architecture, the Maurice Wilkes Award, and the Seymour Cray Award. I'll keep it short so we get to hear from him, which is what you came here for. But one thing I'm not going to forget to mention is that besides being a major computer architect, super accomplished, he's also a pilot, and he flew here his own airplane. And he did not land in the water at this time, right? You landed on land. Anyways, all right, Bill, thank you for coming. It's really great to have you here. BILL DALLY: Thanks Luis. Yeah, it's a pleasure being here. And it's a really exciting time to be a computer architect and working on deep learning. I'll try to share some things that happened with you. So like most people these days, when I sat down to do this talk, I started by just asking ChatGPT to do a talk for me. So this is the transcript. So what would build-- actually, first, I started saying what should be in a talk about deep learning hardware? And it gave me a whole bunch of really bogus stuff. So I said, OK, this is not right. I had to-- it's all about prompt engineering. I had to say what would Bill Dally say about deep learning hardware. And this actually would be a plausible talk, but it's not what I'm going to talk about. But I thought that was kind of neat. If you want to sort of dig a little deeper-- and I always like to put current material in these talks-- this was in the Wall Street Journal yesterday, which is sort of an interview I did with the journal reporter about how we're going to continue scaling deep learning, which is really what this talk is about. But I'm going to go into a little bit more technical detail than what the reporter was able to understand. And, by the way, this is a photo of our Grace Hopper superchip. That's a H100 GPU on the left, and the Hopper multi-core CPU on the right. And, actually, together, they're a pretty awesome thing, although being a parallel computing bigot, the major purpose of that CPU is to provide a lot more high bandwidth memory to the GPU. It's a good memory controller. So if we are playing with things like the large language models that everybody is using today, what I like to think is what they're really doing is that they're distilling data into value, which is really what we-- one of the major goals of computer science is to do that. So we start, typically, with an enormous amount of data. People will typically train these models using everything they can scrape off the internet. 10 to the 12th s-- a is roughly equivalent to a word-- you usually wind up with the ratio of 1.5 to 2 s per word when you run it through the izer-- and maybe a billion images. That's the kind of dataset you have. And what you want to do is get value out of this. And value is being able to answer questions or being able to make a person more productive in their job, helping a teacher give more personal attention to students by having a teacher's assistant, helping a doctor make better diagnoses. Large language models are being used for all of these things. And so the way to think about it is much like a student, you have to sort of send your large language model to undergraduate school. And that's where you take this trillion . And actually, I did this slide some time ago. The big models are now being trained on 10 trillion s or more. Your trillion dataset-- and you spend a pretty big check for some time on AWS. And by the way, that's the cost if you were to buy the GPUs. On AWS they capitalize them such that if you rent them for three months, you paid for them, but they won't give them to you. And you run a training job. And so now you have a trained model. And so this is sort of a large language model that's been to undergraduate school, has a general liberal arts education about all of the s on the internet. But typically, that is what you want. You want a model that's specialized for medicine or specialized for teaching. And so you need to sort of send your model to graduate school. And in graduate school, you typically do a combination of things. One is that you'll do what people will call pretuning, which is just continuing the training process, but on special data. We have a model we use within NVIDIA that makes our hardware designers more productive we call ChipNeMo. NeMo is our general large language model. And ChipNeMo has been pretrained on 24 billion s that we basically scraped from the architecture documents, verilog code, scripts, and everything that we've been using to design GPUs since the mid-1990s. And probably the most productive use of ChipNeMo was saving senior designer time because the senior designers spend a lot of time answering questions from junior designers, like, how does this texture unit work anyway? And they can ask that question to-- ChipNeMo would actually give them a pretty cogent answer because it's read all of the architecture documents about the texture unit. And so you go through that part with special data, but then you also-- it's really important to make it something that people respond well to. So you like to finetune it further by having it generate a couple responses to a query and then having humans grade those responses. By giving feedback from those human responses, you can wind up getting the model-- producing responses that people like better. And the human in the human feedback reinforcement learning is hugely important to getting good quality out of this. What's interesting is that while starting out and giving the model the liberal arts education, it's pretty expensive. People can finetune-- and I put 100k to $1 million here. That's probably really high people have done very effective jobs finetuning for a few thousand dollars of GPU time. So now you have your specialized model, and now you get to where the rubber meets the road. And in the long-term, the bulk of computation for deep learning happens on the inference side. It's interesting-- early on in these explorations, there's a lot of effort in training because people are experimenting a lot. But once they get the formula right, they'll retrain these models periodically, but they'll train a model and they'll do inference on it for months, if not longer, before they go back and do the whole process again. And so an enormous amount of stuff is running on the inference side. Even though each query takes maybe a second of GPU time, you're doing enough of it it actually swamps the months you spent training over on the other side. The recent additions to this slide-- and I kind of shoehorned it in here-- is this retrieval button. And this is probably something that's become really commonplace in maybe the last four or five months is to do retrieval augmented generation, where sort of before I run the large language model and inference, I take my query. I actually use it to query a database, pull up all the relevant documents. And so for ChipNeMo, we go back to this original 24 billion database of documents, pull up all those architecture documents, and then we feed them, along with the original query, into the input window of the transformer and then run the inference code on it. And this prevents your large language model from using psychedelic drugs and hallucinating. It basically grounds what it's doing in actual documents. It can give you references to documents that are real and greatly improves its accuracy. So let me start with some motivation, and we'll talk about how we got where we are. And then the more exciting part of the talk is where we're going from here. So some motivation and a little bit of history-- so the current revolution in deep learning was enabled by hardware. I like to think of deep learning as having three ingredients. They're the algorithms, and being a person who used to play with cars-- actually, I still play with cars-- I like to think of this as fuel, air, and spark. So the fuel here is sort of the algorithms. So most of these algorithms, in at least their basic form, have been around since the 1980s. So deep neural networks, convolutional networks, training them with stochastic gradient descent and backpropagation, all that existed in the '80s-- in some cases, far earlier. The next ingredient, the air, as it were, is data. And large labeled datasets-- for unsupervised data, the data has been around forever. But for supervised data, large labeled data sets like Pascal and ImageNet, have been around since the early 2000, so at least 10 years before the real spark ignited. But the spark that sort of ignited the fuel and air mixture here was enough computing to train a large enough model on a large enough amount of data in a reasonable amount of time. In this case, the original ImageNet dataset was-- I think it was in the order of 10 million images. The large enough model was AlexNet. And I'll show in a minute what the training demands of that were on the next slide. And the reason a lot of time was two weeks-- it took two weeks, actually, on a pair of Fermi generation GPUs to do that. Now since then, progress in deep learning has been gated by how much compute power we have to apply to it. So if you think about what the training time of AlexNet was, it basically was 1/100 of a petaflop day-- that is, we had a petaflop machine, and you'd spend 1/100 of a day on it. I guess that ends up being about a quarter of an hour-- 15 minutes-- on a petaflop machine. And during this sort of ConvNet period, from 2012 to 2016, we went up about two orders of magnitude. So we're up to about a petaflop day by the time we got to ResNet. Since we started on the large language models with BERT in 2018, we've been going up about an order of magnitude a year. GPT-4-- my estimate is up in the corner. OpenAI has not published the details of that model, but sort of scraping various rumors off the web, I estimated it being about a 10 to the 6th petaflop day-- about a million petaflop days to train GPT-4 in 2023. And we are very hard-pressed-- this is an increase in 10 to the 8th in compute demand of the state-of-the-art deep learning models over about a decade. And we are very pressed to provide that. So what I'm going to talk about for the next phase of the talk is how we provided about 1,000-fold of this by increasing the performance of individual GPUs and then the other 10 to the 6-fold has come from scaling up the number of GPUs and the amount of time spent training. So let me give a little bit of history. So here is the curve I like to call Huang's law after our founder of NVIDIA Jensen Huang, which is that, on deep learning, our inference performance has basically been doubling every year for the last decade, starting with a performance of about four. This is INT8 tops on single chip inference in the Kepler generation and with 4,000 in the Hopper generations. Over 10 years, a 1,000x increase in performance. If it was doubling every year, it'd be 1,024. So how did we do this? And by the way-- actually, I'll explain this in the next slide. The main gains are listed here sort of in order of their contribution. The biggest gain is by using smaller numbers. So when we started out with Kepler, Kepler basically wasn't designed for deep learning. It was designed for scientific computing, where we needed FP64, and graphics where we needed FP32. And so people were doing inference, which you can get by with INT8, in FP32. And so your numbers are four times as big. But the expensive arithmetic operations are the multiplies, and they scale quadratically with the number of bits. And so it was a 16x increase in cost, not a 4x, being four times as big. At ISCA, I think it was in 2017, Dave Patterson gave a talk about the Google TPU V1 and claimed that it was because it was so specialized that it was more efficient than NVIDIA GPUs. But, essentially, the entire performance advantage-- I should say energy efficiency advantage of the TPU V1 can be attributed to it doing INT8 operations, and him comparing it-- despite the fact he was doing this in 2017 when he could have compared it against Volta or even Turing, he was comparing it against Kepler, which was a five-year earlier part in FP32. So it was all about that 16x. The next big gain came from doing complex instructions. And I'll have a slide on this. I'll go into a little bit more detail. But even with the very simplified pipeline of a GPU, where we don't do any branch prediction, we don't do any out-of-order execution, there's still the cost of executing an instruction is about a factor of 20 greater than the cost of the arithmetic in that instruction, especially-- and at these lower precisions, it's worse. That's for FP16. And so to basically amortize that, you want to do a lot of work with one instruction. It turns out, you don't want risk. You want complex instructions to do a lot of work to amortize the cost of that instruction. And so the next factor of 12.5 came from adding complex instructions. We went from an FMA, which is the biggest instruction we had in Kepler, to a four-element dot product, DP4, in our Pascal generation. And then our matrix multiply instructions, HMMA for Half-precision Matrix Multiply Accumulate, and IMMA for 8-bit Integer Matrix Multiply Accumulate. I'll give a little more detail on these. And that basically amortized out a bunch of the overhead, gave us another 12.5x. There are four generations of process technology represented here, and they're color-coded-- black, green, blue, and such. So Kepler and Maxwell were 28-nanometer. Pascal, Volta, and Turing are 16. Ampere is 7 and Hopper is 5. But that huge jump down from 28 to 5 on the math operations that matter gave us about 2.5x. And I have my own internal spreadsheet where I've been tracking that. So we're not getting much from where people traditionally have gotten a lot of performance improvement, which is from better process technology. It's been mostly from better architecture. And then the other architectural contribution was sparsity. And this is the one when I talked about where we're going in the future is where I think we have a lot more that we can gain going forward. Right now, we're exploiting 2-to-1 sparsity on weights only. We can export much higher levels of sparsity, and we can exploit it on activations as well. I should also add that the algorithm people have done a good job as well. I think there's another 1,000x, which has come from just more efficient models over the years. And one of the great examples I have of this is during sort of the ConvNet days, when everybody's competing on the ImageNet competition, going from VGGNet to GoogleNet, the GoogleNet was just a much more efficient network. They got rid of the fully-connected backend. They used separable convolutions and bypass, and a lot of things that just made the network orders of magnitude more efficient. So they able to get a big step up in performance without nearly as many increases in operations as they otherwise would need. So there's been a lot of models as well. So let me start by talking about complex instructions and why they're important. So like I said, even for a GPU which has an extremely simple pipeline, the overhead factor is about 20. For a complex, out-of-order CPU doing something like an FP16 operation, that overhead factor is more like 1,000, which is one reason why you don't want to be doing deep learning on CPUs, at least not unless you're doing it all with the MMX instructions. And so if all we're doing is a fused multiply adds, we're doing two arithmetic operations, and we have amortize that out, all of our energy is going into overhead, kind of like a big corporation. If we, instead, can at least do eight operations-- so a dot product 4 is eight operations. It's four multiplies, and then you sum that together into another element. So four multiplies, four adds-- the payload energy is now up to six. The overhead energy is the same. So it's only a 5x overhead. We're getting better, but we're still spending more energy on overhead than on operations. In Volta, we introduced what the marketing people called tensor cores. What it really is an instruction that does matrix multiply. HMMA is Half-precision Matrix Multiply Accumulate. It takes two FP16 4 by 4 matrices, does a matrix multiply. So you go across the rows and columns and every element hits every other [INAUDIBLE] n-cubed operation, so it's 4 times 4 times 4-- 64 multiplies. You have to add all those up, and then you add it into an FP32 4 by 4 matrix. So it's 128 operations. The total energy of the operations is 110. I think I lost a projector. And so the overhead now is only 22%. And in Turing, we introduced the integer version of that, IMMA, which takes two 8 by 8 INT8 matrices, multiplies them and sums it up. And actually, in Hopper, we have the quarter precision matrix multiply accumulate does FP8. And so the point to be made here, before I go forward, is that once we have these large instructions, our completely programmable GPU is as efficient at deep learning as a hard-wired accelerator, like a TPU or one of the other special-purpose chips people have built because they don't have zero overhead. They still have to move numbers around. Everything is not in their matrix multiply. And their overhead is most likely in the 15% to 20% as well. So, at this point in time, we're as efficient as a dedicated accelerator. We have all the advantages of a programmable engine with a very mature programming system behind it and a set of libraries that has been built up over decades. AUDIENCE: And that's energy overhead, right? So there's also [INAUDIBLE]. BILL DALLY: That's right, yeah. Energy is a good proxy, though. We got our projector back over there. So where are we today? So here, today, we have Hopper. It's a petaflop of TensorFloat-32. It's 1 or 2 petaflops of FP16 or bfloat16 depending on whether you're dense or sparse-- and I'll explain the sparsity in a minute-- and 2 or 4 petaflops of FP8 or INT8, depending on whether you're dense or sparse. It's got over three terabytes per second of memory bandwidth to 94, and that 94 is correct, 96 gigabytes of HBM3 memory. It's got 18 NVLINK ports, which give us 900 gigabits per second of bandwidth off this chip and 700 watts. And, unfortunately, it's illegal to ship this part to China, which is actually the wrong thing to do because all the export restricting this part does is it causes Chinese programmers to write code for the Huawei parts instead, which does not benefit the US in one iota. It's got some neat things. Actually, one thing I'm very proud of is I wrote the proposal for the dynamic programming instructions that are in this part to accelerate bioinformatics code. So if you want to do dynamic programming to do gene sequence matching, it just screams at doing that. The figure of merit-- I'll come back to this when I talk about some experiments we've done on accelerators-- is 9 teraops per watt. And that's either INT8 or FP8 math. And that's a sort of a way of comparing different deep learning solutions to see how efficient they are. So that's how we got 1,000x. But we need 10 to the 8x. Where does the rest of it come from? And the rest of it comes from using multiple GPUs. And, in fact, you have to use multiple GPUs because the models don't fit in 94 gigabytes. GPT-4 is a 1.2 trillion-parameter model. And when we're doing training, it takes us about 20 bytes per parameter because we not just have to hold that parameter, but we have to hold the momentum coefficients for that parameter, and depending on what training algorithm we're using and a bunch of other overhead. And so to hold one copy of GPT-4 takes us about 20 GPUs. And so we take the work, and we divide the parallelism up into what's called model parallelism-- basically, a bunch of GPUs working on one copy of the model. And we split that in two dimensions. We have what we call tensor parallel, where in what I'm showing in the x-dimension here is we take the individual matrices, and we slice them up. And we usually slice them in just one dimension. So we'll slice them, for example, into column strips. And then we do the operations and put the results back together. And the other direction is pipeline parallel, where we'll basically take the different layers of the network, and we'll put them on different GPUs and forward the results from one layer to the next. And it isn't that the earlier layers idle. We start them on the next batch of training data while the later layers process. And then, after we've done that, to continue to scale up, we do data parallelism. We run separate copies of the model, and we'll take a batch of training data, split that batch across these different copies of the model, have them all train on that batch, and then they'll exchange the weight updates so everybody has the same set of weights for the next iteration. So we build a bunch of hardware to do this. The starting building block is the HDX server. It's 8 H100s and four of our NV switches, which switch the NVLINK interconnect coming out of the H100. I won't go through the details. But at the end, it's 32 petaflops of compute at 11 kilowatts and 900 gigabytes per second coming out the back panel. In addition to the four NV switches on there, if we want to connect more of those together, we actually make a pizza box with two NV switches in it, and you can use this with active optical cables to connect up larger systems. So what we can tend to what we tend to drop on customer fours is something that looks like this, where it's a DGX superpod. Each one of the little gold front panel plates you see there is an HGX100. It's 8 GPUs. We can connect a bunch of those together with NVLINK or, at some point in time, we split that network and start going with InfiniBand-- those are InfiniBand quantum switches in the middle there-- and cable this all together. And what's nice about this are a couple of things. First of all, the software is preconfigured. So if you buy one of these from NVIDIA, get this thing on your floor, connect it up to all the power, plug all the active optical cables together, you turn it on, you can be training deep learning models in an hour, whereas if you tried to get all the network configuration up and running and everything tuned from scratch, it would probably take you a month. I used to build a lot of supercomputers with Cray and other companies. But from the time we had that thing fully assembled on the floor at Los Alamos or Oak Ridge to when they were running a useful problem, it was typically six months because of that bring up and tuning process. And that's basically by having a preconfigured system where that's already been solved, you can turn it on, and it works because that tuning process was not debugging the hardware. The hardware was working. It was getting the software all together. AUDIENCE: Is it fair to say this machine is more focused on training because of [INAUDIBLE] here, or those are things-- BILL DALLY: It's a useful inference machine as well. And you can configure the amount of bandwidth you want. So the bandwidth in the box is preconfigured. You've got the eight GPUs connected by NV switches. But on the back panel there, you can choose to connect up all of that or none of that. And then you also have PCIe slots in there, where we put in InfiniBand NICs. And so you can then decide how much you want to connect up with the NICs. You can provision the bandwidth above each box to be what you need. But the other neat thing, especially for training, is that both the NVLINK network and the InfiniBand network support network collective. So there's all reduce and things like that. And so you're doing the data parallel, where everybody has to exchange their weights. Normally everybody would have to exchange the weights. You'd have to add them, and then you'd have to give back the added weights together. With the all reduce in the network, you just send your weights in and receive the sum. And so it basically essentially doubles the effective network bandwidth for data parallelism. So let's talk about software a little bit. This slide is really dim. So what I like to say sometimes about deep learning is that anybody can build a matrix multiplier, but software makes it useful. And at NVIDIA, we sort of kicked off the process of building deep learning software in 2010 when I actually met my Stanford colleague Andrew Ng over breakfast, and he was telling me about finding cats on the internet. And I said, gee, the GPUs would be better at that than CPUs. And so I assigned a guy at NV research-- actually a programming language researcher called Bryan Catanzaro to go work with Andrew and find cats on the internet. And the software he wrote became cuDNN. Of course, Andrew then stole Bryan and had him working at Baidu for a while. He's since come back to NVIDIA. But we started in 2010, and since then, we've built an enormous amount of software. This is the different layers of it here. We sort of have three main piles of software-- our AI stack or HPC stack and our graphics stack, which we call Omniverse, and then a bunch of verticals built on top of that, things ranging from medical diagnostics with Clara to physics simulations with Modulus, our self-driving car stuff with Drive and so on. And there's probably many tens of thousands of person years of software effort in here. And the way this really manifests itself are in the MLPerf benchmarks. And I think this is a really good measure of-- I like to look at it to see where the competition is but just where the deep learning community is in general. This is great. I mean, in the CPU world, back when I did CPU architecture in the '80s and '90s, there were the Spec Mark benchmarks, and everybody would compete on Spec. And they became a little bit artificial. I think what's really great about the MLPerf benchmarks is they roll new ones out periodically. They now have an LLM benchmark. They've got a good recommender benchmark. They tend to track a rapidly-moving industry pretty rapidly with very little lag. And the point I wanted to make with this slide is not that Hopper is 6.7 times faster than when H100 was first announced. When H100 was first announced, it said Ampere is 2.5 times faster than it was when it was first announced. That's the same hardware. That's improvements in the software on Ampere and building on a big software base to begin with. So people can build really great matrix multipliers. But unless they have this many tens of thousands of person years effort in deep learning software, it's very difficult to be competitive. And this shows up in the MLPerf results. Here's just a bunch of headlines I clipped. The most recent one is also yesterday-- November 8 up there. And unfortunately, you have to read down the article a little ways before you find the part that says how great NVIDIA is. But in the other ones, I highlighted how great NVIDIA is in all of these. And so basically-- I forget what movie it is, where the protagonists are running, and right behind them is, like, the Mongol hordes along with a bunch of wild beasts and stuff that they're running to stay ahead of. That's what it's sort of like to be in NVIDIA these days because everybody and their dog is out there trying to build AI hardware because they see it as sort of the next Yukon gold rush. And we're trying to defend a position, not by impeding them any way, but just trying to run faster than them. And it's this awkward thing where most of them are going to trip and fall, but all it takes is for one of them to keep running and us to trip, and it doesn't end well. So where are we going? How are we going to stay ahead of the Mongol horde in having better AI hardware? And so a good way to answer that question is to plot a little chart here of where the energy goes when we're doing inference. And so this big chunk over here is math, datapath and math-- 47%. So half of our energy is going into doing math. We've got to do better math. And the way to do better math is to have better number representations or to exploit sparsity to do less math. Then the next bunch of these are all memories-- the accumulation buffer, the input buffer, the weight buffer, the accumulation collector, and then 6% is moving data around. That's basically mostly staging it from off-chip memory to on-chip memory, and from the big on-chip-memory to the smaller on-chip memory. So what are we going to do? So there's a lot we can do in the number representation. I'm going to talk about several of these here. And probably the most effective thing we've learned is that what you want to do is use the cheapest number representation you can get away with. And the way to do that is to properly scale your numbers to exactly fit into the dynamic range of that number representation. And I'll tell you a little bit about some clever ways we found to do that. And also, I'm a big fan of long numbers, and I'll tell you about some clever ways of how to make adds of log numbers easy because, normally, the thing is that multiplies are easy because they're adds, but adds are hard because they involve having to convert back to integer. And then right now, we're just doing sparsity on weights. We can do sparsity on activations, and there's also a lot of lower-density things we can exploit for sparsity. To reduce that little 6% data movement, we can do better tiling-- basically a better way of scheduling our loops and to minimize the movement and maximize reuse. And then there's a lot of work we're doing on circuits. So half of this energy here is memory. You can do very simple things to make memory more efficient. I'll give you an example of one. So it turns out that these memories are write once, read many. So it's OK to make them energetically expensive when you do the write. And so one way to exploit that is to have a bitline per cell. So you go ahead and you write the cell, and then you energize that bitline, and the output of the memory just [INAUDIBLE]. Right? So when you do a read, all you're doing is selecting from one of a bunch of bit lines that are right in front of the multiplier, and the reads are almost no energy. The writes are where all the energy is. And that way, you don't toggle the bit lines every time you do a read. You do that one write. Things toggle. But then, on the reads, you're toggling only the multiplier input. There's better communication circuits-- if I was to assign a circuit student, design a way of signaling a bit from one end of the chip to the other that uses the most energy possible, they would do what we do today, use logic-- power supply for 1, ground for 0. But if you're willing to spend a little bit of transistor area, you don't need 1 volt to signal a 1 or a 0. That's enough energy to fry a cow. You need just enough to get your Eb over N0 up to maybe 20 dB. That's enough-- you probably will not get any measurable errors. And you can probably do that with 50 millivolts. So there's a lot you can do with better communication circuits. The one thing I'm really very excited about and we're pushing very hard on this is to increase our memory bandwidth and simultaneously reduce our memory energy by doing memory by stacking DRAM directly on top of the GPU. But it turns out there's an enormous number of technical problems to be solved with that, and it probably will not happen for a few generations. So I'm going to talk about some, not all of these because we only have an hour or so. Let me start with number representation. So when you ask how good is a number system-- it's getting hot in here. How good is a number system-- I'll just want a chair here. There's really two questions you want to ask. There's really two questions you want to ask. So one is what is the accuracy I'm going to get out of this, basically and what that really is real numbers are real numbers. They cover the whole number line. And so accuracy is really about what is the maximum error I get, taking an arbitrary number on the number line, and converting it into a number system because I have to round up or down to some representable value. And the other is how big a range of numbers can I represent? What is the dynamic range? And there's two-- that's the accuracy side and the cost side. There's two aspects of it. One is really just the number of bits. If I have to read it from storage or move it over a wire, the storage and wire doesn't care what those bits are. It's just how many bits you're referencing. So the fewer bits I can use, the better. But then I have to do an operation, and so how expensive is it to do a multiply accumulate? So here are a bunch of representative numbers I'll talk about briefly. So I can use an integer-- integer, as it turns out-- let me go to the next slide as I talk about these-- integers, it turns out, have really awful accuracy. And that's because their error is independent of value. So it's a half an LSB, regardless of whether I'm dealing with a small number or a big number. And I've got a graph that will illustrate this on the next slide. And so their worst case error is 33%. It's the difference between 1.5 and 1, because I have to round that 1.5 up to 2 or down to 1. So it's a half out of that 1. And that's not good. The floating point is much better, and log is better yet. I should have done log 8 and FP8, but I've got a graph that shows that in a little while. Symbol is great because it's sort of the-- I should say what symbol is. So it turns out you can use backpropagation and stochastic gradient descent to train anything. So you can train a codebook, and Song and I wrote a paper about this in ICLR-- it was, like, 2015, I think. And what we did is we trained a 16-entry codebook to get the best possible 16 weights to represent the distribution of weights that we had. And it was great in the sense that it was-- if you're limited to 16 values, or it could be 64 or 256, depending on how many bits you use, that's the best possible you can do in representing that. It will minimize that storage and movement energy. But then you actually have to do the math. And to do the math, you have to actually represent those points you've picked. So you need to do a pretty high precision lookup out of that table. So just that lookup is expensive. And then you're doing-- what we did is 16-bit math. And so we wound up doing 16-bit math with these 4-bit symbols, it just killed any energy advantage we had. It was clever, but it wasn't a very good idea. And then spiking-- I had to put that up here because this is if you assign the student to do the worst possible representation in terms of energy, you would pick spiking because what dissipates energy in CMOS? It's toggling a line. So suppose I have to represent an integer between 1 and 128. So in spiking the average representation, say, over 64, I've got to toggle the line 64 times. Now if, instead, I used a 7-bit integer to represent 0 to 128, I've got seven bits, half of them will flip-- flip, not toggle. So that's 3 and 1/2. So it's 1.75 toggles versus 64. So it's a 32x overhead of spiking versus integer representation-- not a good thing. And then analog-- I actually have a whole talk on why analog is great for doing individual operations but bad at the system level. But I won't go into that. The short story is to either store or move an analog number, you typically have to convert it to digital. There's no easy way to do direct analog storage or movement. And converting it to digital negates any advantage you had from doing the operation in analog. So let's talk a little bit about representation and distribution. So if you prune a network, which you want to do if you exploit sparsity, you'll get a bimodal distribution like this. Is a distribution of weights from-- actually, this is from that ICLR paper in 2015. And if you were to quantize linearly-- basically just use an integer representation-- and you had 16 values, they'd be where the X's are. And you see that I'm sampling a lot out here, where there's nothing happening. And I've got like three X's under that whole lobe of distribution. So I'm going to have very high error. On the other hand, if I train the codebook, I get the little red dots where I'm basically putting all of my dots where they do a lot of good. And, in fact, it's optimal that way. You trained it to be. There's a k-means clustering step, but then, otherwise, you're sort of using stochastic gradient descent to move those codebook entries to the best possible place. And so that's what you want to do. If you don't prune, you tend to get distributions that look like this. And the important thing to understand here is that most of your values are down near zero. It's really important to represent the small values well. Errors in the small values tend to have bigger weight than errors in the larger values because there aren't very many of those. So let's talk about log representation. So for those of us from a certain era, when we went off to college, we did our math calculations on things that looked like. This is a slide rule. And it's basically doing log computation. The little lines scribed on here are spaced logarithmically. And so to do a multiply-- say I wanted to multiply 1.12 by 1.2, I put the one by 1.12 over there. I go over to 1.2, and I read the answer off. What I've done is I've turned the multiply into an add. I'm adding the linear distance and using this logarithmic table that's encoded in the scribe lines on the slide rule to do the math. So compared to INT or FP, it's got great properties. So if I take a logarithmic representation-- and here, I'm doing what I call log 4.3, which is an 8-bit representation. It has one sign bit. It has 7 exponent bits, 4 to the left of the binary point, and 3 to the right of the binary point. What that means is if you concatenate them together, you're doing 2 to that number over 8. And you need to have things to the right of the binary point because if you just take powers of 2 they're spaced too far apart. You need more gradations in your number system than that. And so by moving that binary point back and forth, you can trade dynamic range for accuracy. To compare it to FP8, I picked E4M3, so 4-bit exponent, and you take the same three bits there, but rather than being an extended exponent, they are mantissa values. And there, you have the same dynamic range because I basically pick the exponent the same. But my worst case accuracy is 50% higher. And that's because-- and I'll show you the graph in a minute-- I'm basically scaling the error, but I'm scaling it in blocks of values. Each block with the same exponent has the same error, and then it jumps up to a larger error, whereas with the log representation, I'm scaling it with every value. And so the error for the smallest numbers are really small. So let's look a little bit some more things. So here is log versus integer. And so you see the great thing about the log representation is where the numbers are small, which is where you really care, because that's where most of the weights are, the errors are really small. And the max error, which is the one between 1.5, right around 1.5 either up or down is 9%. This is with 4-bit log 2.2, no sign bit. With a 4-bit integer representation, that max error down here is 33% because it's the same absolute value. It's 0.5 everywhere. So it's 0.5 whether you're at 1.5 or whether you're at 15.5. It's the same error. And so where you really care about it, you get really big errors, and it's not proportional. Now floating point is sort of poor man's log. So here is floating point 2.2 compared to the log 2.2. And the error is like half, again, as much. So it went from 9 to 13. And that's because down here, you've got the same step size for the first four steps and then the same step size for the next four, whereas you're increasing the step size in every element with log. So you wind up with a smaller error where it really matters with the smallest values. So what are some properties of the log number system? So multiplies are cheap. You just do an add. That's actually way cheaper than doing a multiply in integer or floating point because there, you actually have to do a multiply, which is a quadratic cost operation, and add is a linear cost operation-- linear in the number of bits. But the problem is that adds are hard because, normally, to do an add in the logarithmic number system, what we have to do is do a lookup. So I should say it depends. It depends on converting the green area here, the EI and the EF are different. The EI is just a shift. You take that exponent, tells you how far to shift it. But the EF is a lookup. You have to look up one of eight values, which are 2 to the-- 2 to the 0 is easy. That's 1. But then 2 to the 1/8, 2 to the 1/4, 2 to the 3/8, 2 to the 1/2 and so on. And you have to look up the binary representations of those and represent them to enough bits to hold accuracy, then shift them by the amount of EI, and then do the add. And if you were to do that for every add, it would be a really expensive operation. But think about what you're doing in a deep learning system. You're typically doing a whole bunch of multiplies and adding the results all up to compute a new activation-- in fact, often tens of thousands of multiplies, and then adding them all up. So here's a figure from a particular US patent application. The number is down there and what it shows is how to do this inexpensively. And the way you do it is you factor out that table lookup to the outside of that 10,000 elements. So suppose I have 10,000 elements to add up, and I have that EI and EF for every one of them. So what I do is I sort them based on EF, which 2 to the 1/8 or 2/8 or 3/8 or 1/4 do I need to multiply it the end? And I take all of those, and I just add the integer part together, which is easy. So if I take the remainder components, the EF parts, and I pick in the sorting unit which of the-- say it's three bits-- which of the eight outputs I root the quotient components to. And in a typical application, it's not one per cycle. It's like 8 or 16 per cycle that are coming in here. And then I route that sign in EI, and I take the sign bit, and I shift it by the quotient. And then I sum those accumulators up. And by the way, it's a one hot sum that you can exploit for-- because you're only flipping one bit. You can exploit that for some energy efficiency if you're clever. And I sum those all up until I'm done with this whole tensor. 10,000 adds done-- on average, about 1,000 in each of these bins. And then I do one lookup. In fact, it's not a lookup. It's hardwired. I have that constant hardwired into here. I do a final multiply by that hardwired number, take that partial sum out, and add them up, and now I have the value in integer form. And there's a very simple way of converting it back to the log form. So that's log number system. We'll now talk about how to optimally clip your numbers. And this is really about how to pick the right dynamic range. So whatever representation you use, whether it's integer or it's log or it's floating point, the critical operation is deciding how to center your representable range on the bunch of numbers. You actually have to represent the distribution. So let me look at two possible ways of doing this. So on the left here is what most people are doing today, where you basically scan over-- let's say I have a bunch of weights or activations. I scan over my weights of activations, and I can do this over different granularities. And that's something I'll talk about later. It's typically done per layer. So I'll take a layer of my network, scan all the weights, say here's my minimum weight, minus 0.8, and here's my maximum weight, 0.8. I'm going to scale my representable range of numbers-- in this case, it's an integer representation because it's evenly spaced-- so that I can exactly represent the largest value and the smallest value. And what you do there is have no clipping noise. I'm not generating any noise by saturating down larger numbers than I can represent. But I have a really large quantization noise. The space between these red bars is really big. On the other hand, I could choose to clip. I could choose to say, rather than represent 0.8, I'm going to make my largest representable number 0.2. That means any number out here I've got to saturate down to 0.2. And that will introduce clipping noise, reducing the value of that number. But my quantization noise is much smaller. And so we were actually beating this around during a meeting one day, and I asked an interesting question, which I didn't think was answerable, which is there a way of figuring out what the right place is to set this clipping factor to get the minimum mean squared error, which isn't exactly what you want, but it's a good proxy, and it's something that you can state. What you really want is the minimum error in the neural network, but that's a much harder thing to do. So one of our recent employees came back an hour later with this integral. He says, yeah, you just solve this integral. And I go, what? AUDIENCE: [INAUDIBLE] GPT-4 or-- BILL DALLY: No, this was before GPT-4. He actually did the math. I mean, it was pretty it was pretty impressive. And so it turns out solving that integral is hard, and you actually probably don't want to do it for every-- especially for activations, which change every time. But there's this iterative equation, which approximate the integral, which he came back with like a day later. And if you do, like, 2 or 3 iterations of that, it gives you a really good approximation of the integral. It gives you this upper and lower clipping points. And intuitively, what's going on-- so let's look at, say, this one because it's within reach. This is a particular layer, layer number 13 of this network we're looking at. And say I'm using four bits. If my clipping scaler is out here, it means I'm doing no clipping. I'm representing all the way out to 4.2 or whatever, which is the largest value that can be represented. And so as I start to clip, my noise goes down because I'm reducing the quantization noise a lot. And I'm actually not introducing very much clipping noise because there are almost no weights out there, until I moved down to this point about here, where-- and I've clipped a lot. I've clipped from 4.2 down to, like, 1.2. And at that point, actually, maybe 1.1-- at that point in time, I hit a minimum. And if I go any lower than that, the clipping noise starts to be large enough that it actually starts to make the overall mean squared error go up. But that's where I want to sit. And if I look at sort of the point of error, I don't have the 5-bit line here, but I actually wind up being well below the top of the 5-bit line. In fact, I'm almost down to the 6-bit line if I didn't do any clipping. So it winds up being worth more than a bit to do this in mean squared error. So it's a huge thing not to scale where you think you are, but to actually do optimal scaling to scale to minimize the mean squared error. AUDIENCE: [INAUDIBLE]. So this assumes a pretrained network that you go clip in and you go-- if you have training the loop, you actually assume a certain number of presentation during training, doesn't this answer change a lot? BILL DALLY: That's a really good question. I don't know. I don't know whether it would or not. It would be nice to sort of train the clipping factor. AUDIENCE: Well, to get the simplest one possible. There's a lot of cleverness here. [INTERPOSING VOICES] BILL DALLY: I'm glad you said that because I hadn't thought about this before, but I think you could train the clipping factor just backprop into the clipping factor. And then the rest of the weights would adapt to the clipping factor if you did this. So we're doing this post training. And we're not retraining with the clip, which even there we could do, and it would probably do better. AUDIENCE: I guess more generally, could you pick the simplest one, simplest number presentation that has whatever reasonable dynamic range possible, and whatever crazy properties you have in the distribution, and I assume that's the only math you have, so-- BILL DALLY: We still have to pick a scale factor. The key thing is you're not just using the representation itself. You're using that representation plus a scale factor. And that scale factor winds up being hugely important. And so let me talk a little bit more about scaling. Scaling is probably the biggest message of this talk. So typically, people would either scale-- when we first started scaling, we had to scale factors, one for the forward propagation and one for back propagation because they wound up using very different values, especially because the back propagation you multiply by a learning rate, which makes the numbers much smaller going back. And then we started doing it layer by layer. And then, of course, if you do something and it makes things better, then you ask can I do that again? So we went to smaller granularity. It made it better. So we said, what if we scaled not by layer but by vector? And vector here means 16 or 32 or 64 elements. And so now we have a really tight distribution of every vector, and we can scale that and just get tremendously better results. It winds up being worth a couple more bits of precision. So the way to think about it is out of-- you're sort of typically doing [INAUDIBLE] ConvNet the way it's shown here. I have height by width by channel, H, W, and C. And I'm convolving that by R by S-- that's the size of the convolution. But I'm doing a dot product in the channel dimension by the weights of these channels. And if I take that element, that vector element, that's, say, 32 long in the channel dimension and dot product it with that, I can scale that 32-element vector independently of the rest of the tensor. And to do this, I wind up having to add this little unit to my MAC unit, where on the output, after doing all the weights, weights times activations to convert it back to the prescaled numbers, I have to multiply by 2 scale factors-- one for the weight and one for the activation, sw, and sa. So here's graphically the way to think about it is the big blue blob here is the distribution of the whole layer. The little range of the lines here is the range of my vector. And so, basically, now I'm scaling to a smaller vector. And if I actually optimally clip to the smaller vector, I can do better yet. And so this one winds up doing really well. And I'll give some results for an accelerator we built doing this stuff in a minute. But let me talk about sparsity a bit. So this is a figure from a paper that Song and I had in NeurIPS, then called NIPS in 2015, Where what we showed is you could take a neural network-- and this shows a multilayer perceptron-- and lop out a bunch of the weights, retrain it with a mask that held those weights out, and we basically got the same accuracy. In fact, for multilayer perceptrons, we can knock 90% of the weights out and get the same accuracy. For ConvNets, the number is typically more like 30. You can knock out 60% to 70% of the weights, leaving density of 30% to 40%, and still have the same accuracy. So we thought this was great. We actually did a little test strip called the efficient inference engine to show that we could actually implement this, and it was efficient. But it was efficient compared to a scalar engine. We could show that by having-- we had dedicated hardware to walk the CSR structure that would make it easy for us to do sparse computations at, like, 50% density efficiently. Because I remember, one of the reviewers of the original paper said, oh, this is of academic curiosity only because everybody knows that if you use a sparse matrix package, your matrix has to be, like, no more than 0.1% dense or it's faster to just run it dense. And if you do it all in software, that's true, but we basically had separate memory arrays to store the CSR pointers. We had special hardware to walk those memory arrays. And so we could, with almost no overhead, do the sparse computation. The problem is nobody wants to do a scalar computation. We have to compare against the-- the current best accelerators we build have parallel vector units. So there's basically 16 parallel units that do 16 long vectors simultaneously. So they're doing 256 max per cycle, and you have to make that work efficiently sparsely. And that has been a problem that I have spent a lot of time since 2015 banging my head against the wall. And I keep coming up with ideas that I think are good ideas, and they look really good until we actually synthesize the logic for them. And then what I find out is that, gee, if I'm trying to do a dot product of two sparse things together, and I'm doing a last-minute computation to split the multiplexers to pick the nonzero ones out and feed them into the MAC units, I generate enormous amounts of glitching because the multiplexers are on the fly trying to figure out which numbers are 0. And they keep selecting this number and saying, nope, that one's nonzero here, but on the other side, it's 0, so let's slide down the vector a little further. And it winds up toggling the math unit 5 times for every computation and killing our energy. So we've been looking at ways of eliminating glitches as well because that wound up being the killer of a lot of these sparse things. The one thing that worked-- and it's in Ampere and Hopper-- is what we call structured sparsity. So it turns out what makes sparsity hard is that it's irregular. And when things are irregular, you spend a lot of energy on bookkeeping and shuffling numbers to get them in the right place. So the answer is don't make it irregular. So you force your sparsity to have this regular pattern. So the recipe is, as with all the sparsity stuff, you densely train your networks. Even for all of our sparsity, it's really for inference only that most of the training is done dense. But then, after you've trained the network, you find the lowest weights, and you lop those away. But we're going to lop those away in a structured way here, where we insist that no more than two out of every four be nonzero. And if you can't do that, you just have to do the computation dense. But if you can lop out two out of four, which you can usually do because, especially if this was an MLP, you could have lopped out 9 out of 10. So two out of four ought to be easy. We lop those out. We then retrain with a mask to make the other way kind of compensate for the ones that are gone. And then we compress by storing just the nonzero weights and a little bit of metadata that says which ones they are. So the metadata, then-- which is static, so it's not switching all the time-- feeds the multiplexers that select the input activations and basically says, OK out of those eight input activations, I take my metadata, four of them are nonzero. Four of them correspond to the nonzeros here. Let's pick those four out and feed them into the multiplier, and we get a nice regular computation, and this basically goes twice as fast. We're still trying to figure out how to extend this and in particular to make it work for activations as well. But once you have two sparsity patterns, now you have to intersect them. And now you've created a regularity. It's not predictable anymore. What makes this work well is it's predictable every cycle. It's easy to extend this to do, instead of two out of four, do two out of eight or any regular pattern you can do with this. And we're playing with those as well. So let me talk a little bit about accelerators and the relation between accelerators and GPUs. I'm running out of time, yakking too much. So we built a bunch of accelerators at NVIDIA, the EIE we did in collaboration with Stanford, Joe Lemar NVIDIA working with [INAUDIBLE] MIT at [INAUDIBLE].. We did this thing called SCNN for sparse convolutional neural networks. We had this multichip module thing we did. And how do these accelerators get their performance? Well, they're really sort of five ways that they do this. The first is special data types and operators. And this is what we get out of the GPU as well. We specialize the data type. In Hopper it's FP8. And we specialize the operation with a QMMA doing that FP8 8 by 8 matrix multiply. And we basically do in one cycle what would take tens or hundreds of cycles to do separately. So QMMA and the IMMA both are doing 1,024 operations in essentially-- it's not one cycle, but with one instruction issue. The next is massive parallelism. So we want 1,000x stuff. We just want to get a lot of things operating in parallel. Probably one of the most important things about building an accelerator is having an optimized memory. Main memory accesses are so expensive. If you have to go out to memory to do anything, you're dead. You're not going to scale. You're not going to get any performance increase. We haven't really seen this with deep learning because deep learning has lots of hierarchical reads. You can do small scratchpads and get really good reuse out of them. When we did a bioinformatics accelerator, which actually led to the dynamic programming instructions in Hopper, we first took the bioinformatics algorithm-- everybody's using Minimap, and we asked if we just specialized hardware for this algorithm, what kind of a speedup could we get? And the answer was, like, 4x, which we were about to quit and move on to something else when we said, well, what if we redid the algorithm? This gets down to the bottom line here of algorithm architecture codesign. So it turns out the bioinformaticists had assumed dynamic programming was expensive, so they spent an enormous amount of time on the sort of seeding stage, trying to find good candidates to feed into the alignment stage. And it turns out, we had flipped that upside down. We made alignment really cheap because we had an alignment engine that was 150,000 times more efficient than doing it on a CPU. And seeding was really expensive because it required main memory access. So we flipped it around and did a really cheap seating stage, which had lots of false positives feeding into the alignment stage, which we didn't care about because the alignment was blindingly fast, and it could basically filter itself, and wind up getting about a 4,000 times overall speed up. But the key there was optimizing the memory because you can't be making main memory references and expect to get anything. And then reduce your amortized overhead. And so for simple operations, compared to a CPU, it's 10,000 times. So I've been doing this for a while. This is fast accelerators since 1985. I started out building simulation accelerators. In the middle, there's a bunch of signal processing stuff, and then, most recently, the neural network stuff and a SAT solver, which is actually kind of an interesting thing. But anyway, I wanted to drive home the point about amortizing out overhead. So this is data from a paper on an simple ARM out of order core, but it's one of the more simple ones. I forget which-- it's an A-something, and I thought I wrote it down on here. But compared to the more aggressive cores we have today, like the Neoverse cores we have in our Grace chip, this is a very efficient CPU. But even so, the cost of doing-- oh, it's an ARM A15. There it is. The cost of doing a CPU instruction, even if that instruction is a no-op, fetching the instruction, decoding it, doing the branch table lookups, doing the out-of-order stuff in the register file, is 250 picojoules to do nothing. To do a 16-bit integer add is 32 femtojoules. So it's approximately 99.99% overhead. So I think the FDA would say it's all overhead. They allow you, at some point, in time to just round up. So some brand of soap says 99.97% pure. And so it's important to sort of keep keeping track the cost of operations to do this. And this is one of what you see here is that for the multiplies, this is scaling quadratically. A 32-bit multiply is not 4 times as expensive as the 8-bit multiply. It's 8 times as expensive-- 16 times as expensive. The floating point is a little bit more complex, but it's roughly the same. It's got this weird normalization stage. And you see that reading things out of memory is more expensive than all that stuff, even than the 32-bit multiply. Reading a relatively small 8k byte memory is more costly. And as you go further away, reading the memory gets even worse. And I like to just use this sort of order of magnitude thing, where if I read a local 8k byte memory, it's 5 picojoules per word. If I have to go across the chip and read, basically, a memory that's hundreds of megabytes, it's 50 picojoules per word, but the memory part of that is still 5. You build big memories out of small memory arrays. And so we basically have one memory array size we tend to use for everything because you're only allowed to have so many bit cells on a bit line and so many cells on a word line. So that's your basic array size. Anything above that you put multiple arrays down. So the other 45 picojoules is communication energy. It's getting the address over to that memory array getting the data back. And then, if you go off-chip, it's way worse. LPR, which is about the most energy efficient DRAM you can get-- it's actually quite comparable to HBM. These days, it's about-- I&

Show more
be ready to get more

Get legally-binding signatures now!