{"id":3409,"date":"2024-03-27T16:45:09","date_gmt":"2024-03-27T16:45:09","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2024\/03\/27\/nvidia-hopper-leaps-ahead-in-generative-ai-at-mlperf\/"},"modified":"2024-03-27T16:45:09","modified_gmt":"2024-03-27T16:45:09","slug":"nvidia-hopper-leaps-ahead-in-generative-ai-at-mlperf","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2024\/03\/27\/nvidia-hopper-leaps-ahead-in-generative-ai-at-mlperf\/","title":{"rendered":"NVIDIA Hopper Leaps Ahead in Generative AI at MLPerf"},"content":{"rendered":"<div id=\"bsf_rt_marker\">\n<p>It\u2019s official: NVIDIA delivered the world\u2019s fastest platform in industry-standard tests for inference on <a href=\"https:\/\/www.nvidia.com\/en-us\/ai-data-science\/generative-ai\/\">generative AI<\/a>.<\/p>\n<p>In the latest MLPerf benchmarks, <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus\/\">NVIDIA TensorRT-LLM<\/a> \u2014 software that speeds and simplifies the complex job of inference on <a href=\"https:\/\/www.nvidia.com\/en-us\/glossary\/large-language-models\/\">large language models<\/a> \u2014 boosted the performance of <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/technologies\/hopper-architecture\/\">NVIDIA Hopper architecture GPUs<\/a> on the GPT-J LLM nearly 3x over their results just six months ago.<\/p>\n<p>The dramatic speedup demonstrates the power of NVIDIA\u2019s full-stack platform of chips, systems and software to handle the demanding requirements of running generative AI.<\/p>\n<p>Leading companies <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus\/\">are using<\/a> TensorRT-LLM to optimize their models. And <a href=\"https:\/\/www.nvidia.com\/en-us\/launchpad\/ai\/generative-ai-inference-with-nim\/\">NVIDIA NIM<\/a>\u00a0 \u2014 a set of inference microservices that includes inferencing engines like TensorRT-LLM \u2014 makes it easier than ever for businesses to deploy NVIDIA\u2019s inference platform.<\/p>\n<p><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2024\/03\/MLPerf-1-GPTJ-LLM-3x.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-large wp-image-70885\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2024\/03\/MLPerf-1-GPTJ-LLM-3x-672x366.jpg\" alt=\"MLPerf inference results on GPT-J LLM with TensorRT-LLM \" width=\"672\" height=\"366\"><\/a><\/p>\n<h2><b>Raising the Bar in Generative AI<\/b><\/h2>\n<p>TensorRT-LLM running on <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/h200\/\">NVIDIA H200 Tensor Core GPUs<\/a> \u2014 the latest, memory-enhanced Hopper GPUs \u2014 delivered the fastest performance running inference in MLPerf\u2019s biggest test of generative AI to date.<\/p>\n<p>The new benchmark uses the largest version of Llama 2, a state-of-the-art large language model packing 70 billion parameters. The model is more than 10x larger than the GPT-J LLM first used in the <a href=\"https:\/\/blogs.nvidia.com\/blog\/grace-hopper-inference-mlperf\/\">September benchmarks<\/a>.<\/p>\n<p>The memory-enhanced H200 GPUs, in their MLPerf debut, used TensorRT-LLM to produce up to 31,000 tokens\/second, a record on MLPerf\u2019s Llama 2 benchmark.<\/p>\n<p>The H200 GPU results include up to 14% gains from a custom thermal solution. It\u2019s one example of innovations beyond standard air cooling that systems builders are applying to their <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/products\/mgx\/\">NVIDIA MGX<\/a> designs to take the performance of Hopper GPUs to new heights.<\/p>\n<p><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2024\/03\/MLPerf-2-Llama-2-win.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-large wp-image-70888\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2024\/03\/MLPerf-2-Llama-2-win-672x389.jpg\" alt=\"MLPerf inference results on Llama 2 70B with H200 GPUs running TensorRT-LLM\" width=\"672\" height=\"389\"><\/a><\/p>\n<h2><b>Memory Boost for NVIDIA Hopper GPUs<\/b><\/h2>\n<p>NVIDIA is shipping H200 GPUs today. They\u2019ll be available soon from nearly 20 leading system builders and cloud service providers.<\/p>\n<p>H200 GPUs pack 141GB of HBM3e running at 4.8TB\/s. That\u2019s 76% more memory flying 43% faster compared to H100 GPUs. These accelerators plug into the same boards and systems and use the same software as H100 GPUs.<\/p>\n<p>With HBM3e memory, a single H200 GPU can run an entire Llama 2 70B model with the highest throughput, simplifying and speeding inference.<\/p>\n<h2><b>GH200 Packs Even More Memory<\/b><\/h2>\n<p>Even more memory \u2014 up to 624GB of fast memory, including 144GB of HBM3e \u2014 is packed in <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/grace-hopper-superchip\/\">NVIDIA GH200 Superchips<\/a>, which combine on one module a Hopper architecture GPU and a power-efficient <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/grace-cpu\/\">NVIDIA Grace CPU<\/a>. NVIDIA accelerators are the first to use HBM3e memory technology.<\/p>\n<p>With nearly 5 TB\/second memory bandwidth, GH200 Superchips delivered standout performance, including on memory-intensive MLPerf tests such as <a href=\"https:\/\/blogs.nvidia.com\/blog\/grace-hopper-recommender-systems\/\">recommender systems<\/a>.<\/p>\n<h2><b>Sweeping Every MLPerf Test<\/b><\/h2>\n<p>On a per-accelerator basis, Hopper GPUs swept every test of AI inference in the latest round of the MLPerf industry benchmarks.<\/p>\n<p>The benchmarks cover today\u2019s most popular AI workloads and scenarios, including generative AI, recommendation systems, natural language processing, speech and computer vision. NVIDIA was the only company to submit results on every workload in the latest round and every round since MLPerf\u2019s data center inference benchmarks began in October 2020.<\/p>\n<p>Continued performance gains translate into lower costs for inference, a large and growing part of the daily work for the millions of NVIDIA GPUs deployed worldwide.<\/p>\n<h2><b>Advancing What\u2019s Possible<\/b><\/h2>\n<p>Pushing the boundaries of what\u2019s possible, NVIDIA demonstrated three innovative techniques in a special section of the benchmarks called the open division, created for testing advanced AI methods.<\/p>\n<p>NVIDIA engineers used a technique called <a href=\"https:\/\/developer.nvidia.com\/blog\/accelerating-inference-with-sparsity-using-ampere-and-tensorrt\/\">structured sparsity<\/a> \u2014 a way of reducing calculations, first introduced with <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/a100\/\">NVIDIA A100 Tensor Core GPUs<\/a> \u2014 to deliver up to 33% speedups on inference with Llama 2.<\/p>\n<p>A second open division test found inference speedups of up to 40% using pruning, a way of simplifying an AI model \u2014 in this case, an LLM \u2014 to increase inference throughput.<\/p>\n<p>Finally, an optimization called DeepCache reduced the math required for inference with the Stable Diffusion XL model, accelerating performance by a whopping 74%.<\/p>\n<p>All these results were run on <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/\">NVIDIA H100 Tensor Core GPUs<\/a>.<\/p>\n<h2><b>A Trusted Source for Users<\/b><\/h2>\n<p>MLPerf\u2019s tests are transparent and objective, so users can rely on the results to make informed buying decisions.<\/p>\n<p>NVIDIA\u2019s partners participate in MLPerf because they know it\u2019s a valuable tool for customers evaluating AI systems and services. Partners submitting results on the NVIDIA AI platform in this round included ASUS, Cisco, Dell Technologies, Fujitsu, GIGABYTE, Google, Hewlett Packard Enterprise, Lenovo, Microsoft Azure, Oracle, QCT, Supermicro, VMware (recently acquired by Broadcom) and Wiwynn.<\/p>\n<p>All the software NVIDIA used in the tests is available in the MLPerf repository. These optimizations are continuously folded into containers available on <a href=\"https:\/\/ngc.nvidia.com\/catalog\">NGC<\/a>, NVIDIA\u2019s software hub for GPU applications, as well as <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/products\/ai-enterprise\/\">NVIDIA AI Enterprise<\/a> \u2014 a secure, supported platform that includes NIM inference microservices.<\/p>\n<h2><b>The Next Big Thing\u00a0\u00a0<\/b><\/h2>\n<p>The use cases, model sizes and datasets for generative AI continue to expand. That\u2019s why MLPerf continues to evolve, adding real-world tests with popular models like Llama 2 70B and Stable Diffusion XL.<\/p>\n<p>Keeping pace with the explosion in LLM model sizes, NVIDIA founder and CEO Jensen Huang announced last week at GTC that the <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/technologies\/blackwell-architecture\/\">NVIDIA Blackwell architecture GPUs<\/a> will deliver new levels of performance required for the multitrillion-parameter AI models.<\/p>\n<p>Inference for large language models is difficult, requiring both expertise and the full-stack architecture NVIDIA demonstrated on MLPerf with Hopper architecture GPUs and TensorRT-LLM. There\u2019s much more to come.<\/p>\n<p>Learn more about <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/resources\/mlperf-benchmarks\/\">MLPerf benchmarks<\/a> and the <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records\/\">technical details<\/a> of this inference round.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blogs.nvidia.com\/blog\/tensorrt-llm-inference-mlperf\/<\/p>\n","protected":false},"author":0,"featured_media":3410,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/3409"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=3409"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/3409\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/3410"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=3409"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=3409"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=3409"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}