{"id":3213,"date":"2023-10-17T13:43:31","date_gmt":"2023-10-17T13:43:31","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2023\/10\/17\/striking-performance-large-language-models-up-to-4x-faster-on-rtx-with-tensorrt-llm-for-windows\/"},"modified":"2023-10-17T13:43:31","modified_gmt":"2023-10-17T13:43:31","slug":"striking-performance-large-language-models-up-to-4x-faster-on-rtx-with-tensorrt-llm-for-windows","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2023\/10\/17\/striking-performance-large-language-models-up-to-4x-faster-on-rtx-with-tensorrt-llm-for-windows\/","title":{"rendered":"Striking Performance: Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows"},"content":{"rendered":"<div id=\"bsf_rt_marker\">\n<p><a href=\"https:\/\/www.nvidia.com\/en-us\/glossary\/data-science\/generative-ai\/\" target=\"_blank\" rel=\"noopener\">Generative AI<\/a> is one of the most important trends in the history of personal computing, bringing advancements to gaming, creativity, video, productivity, development and more.<\/p>\n<p>And <a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/rtx\/\">GeForce RTX<\/a> and NVIDIA RTX GPUs, which are packed with dedicated AI processors called Tensor Cores, are bringing the power of generative AI natively to more than 100 million Windows PCs and workstations.<\/p>\n<p>Today, generative AI on PC is getting up to 4x faster via <a href=\"https:\/\/developer.nvidia.com\/tensorrt\">TensorRT-LLM<\/a> for Windows, an open-source library that accelerates inference performance for the latest AI large language models, like Llama 2 and Code Llama. This follows the announcement of TensorRT-LLM for <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus\/\">data centers<\/a> last month.<\/p>\n<p>NVIDIA has also released tools to help developers accelerate their LLMs, including scripts that optimize custom models with TensorRT-LLM, TensorRT-optimized open-source models and a developer reference project that showcases both the speed and quality of LLM responses.<\/p>\n<p>TensorRT acceleration is now available for Stable Diffusion in the popular Web UI by Automatic1111 distribution. It speeds up the generative AI diffusion model by up to 2x over the previous fastest implementation.<\/p>\n<p>Plus, <a href=\"https:\/\/blogs.nvidia.com\/blog\/2023\/02\/28\/rtx-video-super-resolution\/\">RTX Video Super Resolution<\/a> (VSR) version 1.5 is available as part of today\u2019s <a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/news\/game-ready-driver-dlss-3-naraka-vermintide-rtx-vsr\">Game Ready Driver<\/a> release \u2014 and will be available in the next <a href=\"https:\/\/www.nvidia.com\/en-us\/studio\/resources\/\">NVIDIA Studio Driver<\/a>, releasing early next month.<\/p>\n<h2><b>Supercharging LLMs With TensorRT<\/b><\/h2>\n<p>LLMs are fueling productivity \u2014 engaging in chat, summarizing documents and web content, drafting emails and blogs \u2014 and are at the core of new pipelines of AI and other software that can automatically analyze data and generate a vast array of content.<\/p>\n<p>TensorRT-LLM, a library for accelerating LLM inference, gives developers and end users the benefit of LLMs that can now operate up to 4x faster on RTX-powered Windows PCs.<\/p>\n<p>At higher batch sizes, this acceleration significantly improves the experience for more sophisticated LLM use \u2014 like writing and coding assistants that output multiple, unique auto-complete results at once. The result is accelerated performance and improved quality that lets users select the best of the bunch.<\/p>\n<p>TensorRT-LLM acceleration is also beneficial when integrating LLM capabilities with other technology, such as in retrieval-augmented generation (RAG), where an LLM is paired with a vector library or vector database. RAG enables the LLM to deliver responses based on a specific dataset, like user emails or articles on a website, to provide more targeted answers.<\/p>\n<p>To show this in practical terms, when the question \u201cHow does NVIDIA ACE generate emotional responses?\u201d was asked of the LLaMa 2 base model, it returned an unhelpful response.<\/p>\n<figure id=\"attachment_67501\" aria-describedby=\"caption-attachment-67501\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/10\/ChatWithGeForceNewsACE.png\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/10\/ChatWithGeForceNewsACE.png\" alt=\"\" width=\"1858\" height=\"919\"><\/p>\n<p><\/a><figcaption id=\"caption-attachment-67501\" class=\"wp-caption-text\">Better responses, faster.<\/figcaption><\/figure>\n<p>Conversely, using RAG with recent <a href=\"https:\/\/www.nvidia.com\/en-us\/geforce\/news\/\">GeForce news articles<\/a> loaded into a vector library and connected to the same Llama 2 model not only returned the correct answer \u2014 using NeMo SteerLM \u2014 but did so much quicker with TensorRT-LLM acceleration. This combination of speed and proficiency gives users smarter solutions.<\/p>\n<p>TensorRT-LLM will soon be available to download from the <a href=\"https:\/\/developer.nvidia.com\/\">NVIDIA Developer<\/a> website. TensorRT-optimized open source models and the RAG demo with GeForce news as a sample project are available at <a href=\"https:\/\/catalog.ngc.nvidia.com\/\">ngc.nvidia.com<\/a> and <a href=\"https:\/\/github.com\/nvidia\">GitHub.com\/NVIDIA<\/a>.<\/p>\n<h2><b>Automatic Acceleration<\/b><\/h2>\n<p>Diffusion models, like Stable Diffusion, are used to imagine and create stunning, novel works of art. Image generation is an iterative process that can take hundreds of cycles to achieve the perfect output. When done on an underpowered computer, this iteration can add up to hours of wait time.<\/p>\n<p>TensorRT is designed to accelerate AI models through layer fusion, precision calibration, kernel auto-tuning and other capabilities that significantly boost inference efficiency and speed. This makes it indispensable for real-time applications and resource-intensive tasks.<\/p>\n<p>And now, TensorRT doubles the speed of Stable Diffusion.<\/p>\n<p>Compatible with the most popular distribution, WebUI from Automatic1111, Stable Diffusion with TensorRT acceleration helps users iterate faster and spend less time waiting on the computer, delivering a final image sooner. On a GeForce RTX 4090, it runs 7x faster than the top implementation on Macs with an Apple M2 Ultra. The extension is <a href=\"https:\/\/github.com\/NVIDIA\/Stable-Diffusion-WebUI-TensorRT\">available for download<\/a> today.<\/p>\n<p>The <a href=\"https:\/\/github.com\/NVIDIA\/TensorRT\/tree\/release\/8.6\/demo\/Diffusion\">TensorRT demo of a Stable Diffusion pipeline<\/a> provides developers with a reference implementation on how to prepare diffusion models and accelerate them using TensorRT. This is the starting point for developers interested in turbocharging a diffusion pipeline and bringing lightning-fast inferencing to applications.<\/p>\n<h2><b>Video That\u2019s Super<\/b><\/h2>\n<p>AI is improving everyday PC experiences for all users. Streaming video \u2014 from nearly any source, like YouTube, Twitch, Prime Video, Disney+ and countless others \u2014 is among the most popular activities on a PC. Thanks to AI and RTX, it\u2019s getting another update in image quality.<\/p>\n<p><a href=\"https:\/\/blogs.nvidia.com\/blog\/2023\/02\/28\/rtx-video-super-resolution\/\">RTX VSR<\/a> is a breakthrough in AI pixel processing that improves the quality of streamed video content by reducing or eliminating artifacts caused by video compression. It also sharpens edges and details.<\/p>\n<\/p>\n<p>Available now, RTX VSR version 1.5 further improves visual quality with updated models, de-artifacts content played in its native resolution and adds support for RTX GPUs based on the NVIDIA Turing architecture \u2014 both professional RTX and GeForce RTX 20 Series GPUs.<\/p>\n<p>Retraining the VSR AI model helped it learn to accurately identify the difference between subtle details and compression artifacts. As a result, AI-enhanced images more accurately preserve details during the upscaling process. Finer details are more visible, and the overall image looks sharper and crisper.<\/p>\n<figure id=\"attachment_67470\" aria-describedby=\"caption-attachment-67470\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/10\/VSR-v1.5-sbsbs-1.png\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/10\/VSR-v1.5-sbsbs-1-672x236.png\" alt=\"\" width=\"672\" height=\"236\"><\/p>\n<p><\/a><figcaption id=\"caption-attachment-67470\" class=\"wp-caption-text\">RTX Video Super Resolution v1.5 improves detail and sharpness.<\/figcaption><\/figure>\n<p>New with version 1.5 is the ability to de-artifact video played at the display\u2019s native resolution. The original release only enhanced video when it was being upscaled. Now, for example, 1080p video streamed to a 1080p resolution display will look smoother as heavy artifacts are reduced.<\/p>\n<figure id=\"attachment_67473\" aria-describedby=\"caption-attachment-67473\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/10\/RTX-VSR.png\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/10\/RTX-VSR-672x377.png\" alt=\"\" width=\"672\" height=\"377\"><\/p>\n<p><\/a><figcaption id=\"caption-attachment-67473\" class=\"wp-caption-text\">RTX VSR now de-artifacts video played at its native resolution.<\/figcaption><\/figure>\n<p>RTX VSR 1.5 is available today for all RTX users in the latest Game Ready Driver. It will be available in the upcoming NVIDIA Studio Driver, scheduled for early next month.<\/p>\n<p>RTX VSR is among the NVIDIA software, tools, libraries and SDKs \u2014 like those mentioned above, plus DLSS, Omniverse, AI Workbench and others \u2014 that have helped bring over 400 AI-enabled apps and games to consumers.<\/p>\n<p>The AI era is upon us. And RTX is supercharging at every step in its evolution.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blogs.nvidia.com\/blog\/2023\/10\/17\/tensorrt-llm-windows-stable-diffusion-rtx\/<\/p>\n","protected":false},"author":0,"featured_media":3214,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/3213"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=3213"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/3213\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/3214"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=3213"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=3213"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=3213"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}