{"id":4305,"date":"2025-10-10T00:43:50","date_gmt":"2025-10-10T00:43:50","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2025\/10\/10\/nvidia-blackwell-raises-bar-in-new-inferencemax-benchmarks-delivering-unmatched-performance-and-efficiency\/"},"modified":"2025-10-10T00:43:50","modified_gmt":"2025-10-10T00:43:50","slug":"nvidia-blackwell-raises-bar-in-new-inferencemax-benchmarks-delivering-unmatched-performance-and-efficiency","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2025\/10\/10\/nvidia-blackwell-raises-bar-in-new-inferencemax-benchmarks-delivering-unmatched-performance-and-efficiency\/","title":{"rendered":"NVIDIA Blackwell Raises Bar in New InferenceMAX Benchmarks, Delivering Unmatched Performance and Efficiency"},"content":{"rendered":"<div>\n\t\t<span class=\"bsf-rt-reading-time\"><span class=\"bsf-rt-display-label\"><\/span> <span class=\"bsf-rt-display-time\"><\/span> <span class=\"bsf-rt-display-postfix\"><\/span><\/span><\/p>\n<ul>\n<li>NVIDIA Blackwell swept the new SemiAnalysis InferenceMAX v1 benchmarks, delivering the highest performance and best overall efficiency.<\/li>\n<li>InferenceMax v1 is the first independent benchmark to measure total cost of compute across diverse models and real-world scenarios.<\/li>\n<li>Best return on investment: NVIDIA GB200 NVL72 delivers unmatched AI factory economics \u2014 a $5 million investment generates $75 million in DSR1 token revenue, a 15x return on investment.<\/li>\n<li>Lowest total cost of ownership: NVIDIA B200 software optimizations achieve two cents per million tokens on gpt-oss, delivering 5x lower cost per token in just 2 months.<\/li>\n<li>Best throughput and interactivity: NVIDIA B200 sets the pace with 60,000 tokens per second per GPU and 1,000 tokens per second per user on gpt-oss with the latest NVIDIA TensorRT-LLM stack.<\/li>\n<\/ul>\n<p>As AI shifts from one-shot answers to complex reasoning, the demand for <a target=\"_blank\" href=\"https:\/\/www.nvidia.com\/en-us\/glossary\/ai-inference\/\" rel=\"noopener\">inference<\/a> \u2014 and the economics behind it \u2014 is exploding.<\/p>\n<p>The new independent InferenceMAX v1 benchmarks are the first to measure total cost of compute across real-world scenarios. The results? The <a target=\"_blank\" href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks\/\" rel=\"noopener\">NVIDIA Blackwell platform swept the field<\/a> \u2014 delivering unmatched performance and best overall efficiency for <a target=\"_blank\" href=\"https:\/\/www.nvidia.com\/en-us\/solutions\/ai-factories\/\" rel=\"noopener\">AI factories<\/a>.<\/p>\n<p>\u00a0<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-85689\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2025\/10\/AI-Factory-15x-Perf-ROI-r8-fixed.png\" alt=\"\" width=\"1000\" height=\"571\"><\/p>\n<p><b>A $5 million investment in an NVIDIA GB200 NVL72 system can generate $75 million in token revenue.<\/b> <b>That\u2019s a 15x return on investment (ROI)<\/b> \u2014 the new economics of inference.<\/p>\n<p>\u201cInference is where AI delivers value every day,\u201d said Ian Buck, vice president of hyperscale and high-performance computing at NVIDIA. \u201cThese results show that NVIDIA\u2019s full-stack approach gives customers the performance and efficiency they need to deploy AI at scale.\u201d<\/p>\n<h2>Enter InferenceMAX v1<\/h2>\n<p>InferenceMAX v1, a new benchmark from SemiAnalysis released Monday, is the latest to highlight Blackwell\u2019s inference leadership. It runs popular models across leading platforms, measures performance for a wide range of use cases and publishes results anyone can verify.<\/p>\n<p>Why do benchmarks like this matter?<\/p>\n<p>Because modern AI isn\u2019t just about raw speed \u2014 it\u2019s about efficiency and economics at scale. As models shift from one-shot replies to multistep reasoning and tool use, they generate far more <a href=\"https:\/\/blogs.nvidia.com\/blog\/ai-tokens-explained\/\">tokens<\/a> per query, dramatically increasing compute demands.<\/p>\n<p>NVIDIA\u2019s open-source collaborations with OpenAI (<a target=\"_blank\" href=\"https:\/\/build.nvidia.com\/openai\/gpt-oss-120b\" rel=\"noopener\">gpt-oss 120B<\/a>), Meta (<a target=\"_blank\" href=\"https:\/\/build.nvidia.com\/meta\/llama-3_3-70b-instruct\" rel=\"noopener\">Llama 3 70B<\/a>), and DeepSeek AI (<a target=\"_blank\" href=\"https:\/\/build.nvidia.com\/deepseek-ai\/deepseek-r1\" rel=\"noopener\">DeepSeek R1<\/a>) highlight how community-driven models are advancing state-of-the-art reasoning and efficiency.<\/p>\n<p>Partnering with these leading model builders and the open-source community, NVIDIA ensures the latest models are optimized for the world\u2019s largest AI inference infrastructure. These efforts reflect a broader commitment to open ecosystems \u2014 where shared innovation accelerates progress for everyone.<\/p>\n<p>Deep collaborations with the FlashInfer, SGLang and vLLM communities enable codeveloped kernel and runtime enhancements that power these models at scale.<\/p>\n<h2>Software Optimizations Deliver Continued Performance Gains<\/h2>\n<p>NVIDIA continuously improves performance through hardware and software codesign optimizations. Initial gpt-oss-120b performance on an NVIDIA DGX Blackwell B200 system with the <a target=\"_blank\" href=\"https:\/\/docs.nvidia.com\/tensorrt-llm\/index.html\" rel=\"noopener\">NVIDIA TensorRT LLM<\/a> library was market-leading, but NVIDIA\u2019s teams and the community have significantly optimized TensorRT LLM for open-source large language models.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-85661\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2025\/10\/image2.png\" alt=\"\" width=\"1000\" height=\"563\"><\/p>\n<p>The <a target=\"_blank\" href=\"https:\/\/developer.nvidia.com\/tensorrt-llm\" rel=\"noopener\">TensorRT LLM v1.0 release<\/a> is a major breakthrough in making large AI models faster and more responsive for everyone.<\/p>\n<p>Through advanced parallelization techniques, it uses the B200 system and <a target=\"_blank\" href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/nvlink\/\" rel=\"noopener\">NVIDIA NVLink Switch<\/a>\u2019s 1,800 GB\/s bidirectional bandwidth to dramatically improve the performance of the gpt-oss-120b model.<\/p>\n<p>The innovation doesn\u2019t stop there. The newly released gpt-oss-120b-Eagle3-v2 model introduces <a target=\"_blank\" href=\"https:\/\/developer.nvidia.com\/blog\/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference\/\" rel=\"noopener\">speculative decoding<\/a>, a clever method that predicts multiple tokens at a time.<\/p>\n<p>This reduces lag and delivers even quicker results, tripling throughput at 100 tokens per second per user (TPS\/user) \u2014 boosting per-GPU speeds from 6,000 to 30,000 tokens.<\/p>\n<p>For dense AI models like Llama 3.3 70B, which demand significant computational resources due to their large parameter count and the fact that all parameters are utilized simultaneously during inference, NVIDIA Blackwell B200 sets a new performance standard in InferenceMAX v1 benchmarks.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-85658\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2025\/10\/image1.png\" alt=\"\" width=\"1000\" height=\"563\"><\/p>\n<p>Blackwell delivers over 10,000 TPS per GPU at 50 TPS per user interactivity \u2014 4x higher per-GPU throughput compared with the NVIDIA H200 GPU.<\/p>\n<h2>Performance Efficiency Drives Value<\/h2>\n<p>Metrics like tokens per watt, cost per million tokens and TPS\/user matter as much as throughput. In fact, for power-limited AI factories, Blackwell delivers <b>10x throughput per megawatt <\/b>compared with the previous generation, which translates into higher token revenue.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-85664\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2025\/10\/image3.png\" alt=\"\" width=\"1000\" height=\"563\"><\/p>\n<p>The cost per token is crucial for evaluating AI model efficiency, directly impacting operational expenses. The NVIDIA Blackwell architecture <b>lowered cost per million tokens by 15x <\/b>versus the previous generation, leading to substantial savings and fostering wider AI deployment and innovation.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-85673\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2025\/10\/image6.png\" alt=\"\" width=\"1000\" height=\"563\"><\/p>\n<h2>Multidimensional Performance<\/h2>\n<p>InferenceMAX uses the Pareto frontier \u2014 a curve that shows the best trade-offs between different factors, such as data center throughput and responsiveness \u2014 to map performance.<\/p>\n<p>But it\u2019s more than a chart. It reflects how NVIDIA Blackwell balances the full spectrum of production priorities: cost, energy efficiency, throughput and responsiveness. That balance enables the highest ROI across real-world workloads.<\/p>\n<p>Systems that optimize for just one mode or scenario may show peak performance in isolation, but the economics of that doesn\u2019t scale. Blackwell\u2019s full-stack design delivers efficiency and value where it matters most: in production.<\/p>\n<p>For a deeper look at how these curves are built \u2014 and why they matter for total cost of ownership and service-level agreement planning \u2014 check out this <a target=\"_blank\" href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-blackwell-leads-on-new-semianalysis-inferencemax-benchmarks\/\" rel=\"noopener\">technical deep dive<\/a> for full charts and methodology.<\/p>\n<h2>What Makes It Possible?<\/h2>\n<p>Blackwell\u2019s leadership comes from extreme hardware-software codesign. It\u2019s a full-stack architecture built for speed, efficiency and scale:<\/p>\n<ul>\n<li><b>The Blackwell architecture features include:<\/b>\n<ul>\n<li><b>NVFP4<\/b> low-precision format for efficiency without loss of accuracy<\/li>\n<li><b>Fifth-generation<\/b> <b>NVIDIA NVLink <\/b>that connects 72 Blackwell GPUs to act as one giant GPU<\/li>\n<li><b>NVLink Switch<\/b>, which enables high concurrency through advanced tensor, expert and data parallel attention algorithms<\/li>\n<\/ul>\n<\/li>\n<li><b>Annual hardware cadence<\/b> plus continuous software optimization \u2014 NVIDIA has more than doubled Blackwell performance since launch using software alone<\/li>\n<li><b>NVIDIA TensorRT-LLM, <\/b><a target=\"_blank\" href=\"https:\/\/www.nvidia.com\/en-us\/ai\/dynamo\/\" rel=\"noopener\"><b>NVIDIA Dynamo<\/b><\/a><b>, SGLang and vLLM<\/b> open-source inference frameworks optimized for peak performance<\/li>\n<li><b>A massive ecosystem<\/b>, with hundreds of millions of GPUs installed, 7 million CUDA developers and contributions to over 1,000 open-source projects<\/li>\n<\/ul>\n<h2>The Bigger Picture<\/h2>\n<p>AI is moving from pilots to AI factories \u2014 infrastructure that manufactures intelligence by turning data into tokens and decisions in real time.<\/p>\n<p>Open, frequently updated benchmarks help teams make informed platform choices, tune for cost per token, latency service-level agreements and utilization across changing workloads.<\/p>\n<p><a href=\"https:\/\/blogs.nvidia.com\/blog\/think-smart-optimize-ai-factory-inference-performance\/\">NVIDIA\u2019s Think SMART framework helps enterprises navigate this shift<\/a>, spotlighting how NVIDIA\u2019s full-stack inference platform delivers real-world ROI \u2014 turning performance into profits.<\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blogs.nvidia.com\/blog\/blackwell-inferencemax-benchmark-results\/<\/p>\n","protected":false},"author":0,"featured_media":4306,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/4305"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=4305"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/4305\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/4306"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=4305"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=4305"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=4305"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}