{"id":3633,"date":"2024-06-15T01:12:54","date_gmt":"2024-06-15T01:12:54","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2024\/06\/15\/nvidia-releases-open-synthetic-data-generation-pipeline-for-training-large-language-models\/"},"modified":"2024-06-15T01:12:54","modified_gmt":"2024-06-15T01:12:54","slug":"nvidia-releases-open-synthetic-data-generation-pipeline-for-training-large-language-models","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2024\/06\/15\/nvidia-releases-open-synthetic-data-generation-pipeline-for-training-large-language-models\/","title":{"rendered":"NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models"},"content":{"rendered":"<div>\n\t\t<span class=\"bsf-rt-reading-time\"><span class=\"bsf-rt-display-label\"><\/span> <span class=\"bsf-rt-display-time\"><\/span> <span class=\"bsf-rt-display-postfix\"><\/span><\/span><\/p>\n<p dir=\"ltr\">NVIDIA today announced Nemotron-4 340B, a family of open models that developers can use to generate synthetic data for training large language models (LLMs) for commercial applications across healthcare, finance, manufacturing, retail and every other industry.<\/p>\n<p dir=\"ltr\">High-quality training data plays a critical role in the performance, accuracy and quality of responses from a custom LLM \u2014 but robust datasets can be prohibitively expensive and difficult to access.<\/p>\n<p dir=\"ltr\">Through a uniquely permissive <a href=\"https:\/\/developer.download.nvidia.com\/licenses\/nvidia-open-model-license-agreement-june-2024.pdf\" target=\"_blank\" rel=\"noopener\">open model license<\/a>, Nemotron-4 340B gives developers a free, scalable way to generate synthetic data that can help build powerful LLMs.<\/p>\n<p dir=\"ltr\">The Nemotron-4 340B family includes base, instruct and reward models that form a pipeline to generate synthetic data used for training and refining LLMs. The models are optimized to work with <a href=\"https:\/\/github.com\/NVIDIA\/NeMo\" target=\"_blank\" rel=\"noopener\">NVIDIA NeMo<\/a>, an open-source framework for end-to-end model training, including data curation, customization and evaluation. They\u2019re also optimized for inference with the open-source <a href=\"https:\/\/github.com\/NVIDIA\/TensorRT-LLM\" target=\"_blank\" rel=\"noopener\">NVIDIA TensorRT-LLM<\/a> library.<\/p>\n<p dir=\"ltr\">Nemotron-4 340B can be downloaded now from <a href=\"https:\/\/huggingface.co\/collections\/nvidia\/nemotron-4-340b-666b7ebaf1b3867caf2f1911\" target=\"_blank\" rel=\"noopener\">Hugging Face<\/a>. Developers will soon be able to access the models at <a href=\"http:\/\/ai.nvidia.com\/\" target=\"_blank\" rel=\"noopener\">ai.nvidia.com<\/a>, where they\u2019ll be packaged as an <a href=\"https:\/\/developer.nvidia.com\/blog\/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale\/\" target=\"_blank\" rel=\"noopener\">NVIDIA NIM<\/a> microservice with a standard application programming interface that can be deployed anywhere.<\/p>\n<h2 dir=\"ltr\">Navigating Nemotron to Generate Synthetic Data<\/h2>\n<p dir=\"ltr\">LLMs can help developers generate synthetic training data in scenarios where access to large, diverse labeled datasets is limited.<\/p>\n<p dir=\"ltr\">The <a href=\"https:\/\/huggingface.co\/nvidia\/Nemotron-4-340B-Instruct\" target=\"_blank\" rel=\"noopener\">Nemotron-4 340B Instruct<\/a> model creates diverse synthetic data that mimics the characteristics of real-world data, helping improve data quality to increase the performance and robustness of custom LLMs across various domains.<\/p>\n<p dir=\"ltr\">Then, to boost the quality of the AI-generated data, developers can use the <a href=\"https:\/\/huggingface.co\/nvidia\/Nemotron-4-340B-Reward\" target=\"_blank\" rel=\"noopener\">Nemotron-4 340B Reward<\/a> model to filter for high-quality responses. Nemotron-4 340B Reward grades responses on five attributes: helpfulness, correctness, coherence, complexity and verbosity. It\u2019s currently first place on the <a target=\"_blank\" href=\"https:\/\/huggingface.co\/spaces\/allenai\/reward-bench\" rel=\"noopener\">Hugging Face RewardBench leaderboard<\/a>, created by <a href=\"https:\/\/allenai.org\/\" target=\"_blank\" rel=\"noopener\">AI2<\/a>, for evaluating the capabilities, safety and pitfalls of reward models.<\/p>\n<figure id=\"attachment_72168\" aria-describedby=\"caption-attachment-72168\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2024\/06\/Synthetic-Data-Generation-Pipeline-scaled.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-72168 size-large\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2024\/06\/Synthetic-Data-Generation-Pipeline-672x378.jpg\" alt=\"nemotron synthetic data generation pipeline diagram\" width=\"672\" height=\"378\"><\/a><figcaption id=\"caption-attachment-72168\" class=\"wp-caption-text\">In this synthetic data generation pipeline, (1) the Nemotron-4 340B Instruct model is first used to produce synthetic text-based output. An evaluator model, (2) Nemotron-4 340B Reward, then assesses this generated text \u2014 providing feedback that guides iterative improvements and ensures the synthetic data is accurate, relevant and aligned with specific requirements.<\/figcaption><\/figure>\n<p dir=\"ltr\">Researchers can also create their own instruct or reward models by customizing the <a href=\"https:\/\/huggingface.co\/nvidia\/Nemotron-4-340B-Base\" target=\"_blank\" rel=\"noopener\">Nemotron-4 340B Base<\/a> model using their proprietary data, combined with the included <a href=\"https:\/\/huggingface.co\/datasets\/nvidia\/HelpSteer2\" target=\"_blank\" rel=\"noopener\">HelpSteer2 dataset<\/a>.<\/p>\n<h2 dir=\"ltr\">Fine-Tuning With NeMo, Optimizing for Inference With TensorRT-LLM<\/h2>\n<p dir=\"ltr\">Using open-source NVIDIA NeMo and NVIDIA TensorRT-LLM, developers can optimize the efficiency of their instruct and reward models to generate synthetic data and to score responses.<\/p>\n<p dir=\"ltr\">All Nemotron-4 340B models are optimized with TensorRT-LLM to take advantage of tensor parallelism, a type of model parallelism in which individual weight matrices are split across multiple GPUs and servers, enabling efficient inference at scale.<\/p>\n<p dir=\"ltr\">Nemotron-4 340B Base, trained on 9 trillion tokens, can be customized using the NeMo framework to adapt to specific use cases or domains. This fine-tuning process benefits from extensive pretraining data and yields more accurate outputs for specific downstream tasks.<\/p>\n<p dir=\"ltr\">A variety of customization methods are available through the NeMo framework, including supervised fine-tuning and parameter-efficient fine-tuning methods such as low-rank adaptation, or LoRA.<\/p>\n<p dir=\"ltr\">To boost model quality, developers can align their models with <a href=\"https:\/\/github.com\/NVIDIA\/NeMo-Aligner\" target=\"_blank\" rel=\"noopener\">NeMo Aligner<\/a> and datasets annotated by Nemotron-4 340B Reward. Alignment is a key step in training LLMs, where a model\u2019s behavior is fine-tuned using algorithms like reinforcement learning from human feedback (RLHF) to ensure its outputs are safe, accurate, contextually appropriate and consistent with its intended goals.<\/p>\n<p dir=\"ltr\">Businesses seeking enterprise-grade support and security for production environments can also access NeMo and TensorRT-LLM through the cloud-native <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/products\/ai-enterprise\/\" target=\"_blank\" rel=\"noopener\">NVIDIA AI Enterprise<\/a> software platform, which provides accelerated and efficient runtimes for generative AI foundation models.<\/p>\n<h2 dir=\"ltr\">Evaluating Model Security and Getting Started<\/h2>\n<p dir=\"ltr\">The Nemotron-4 340B Instruct model underwent extensive safety evaluation, including adversarial tests, and performed well across a wide range of risk indicators. Users should still perform careful evaluation of the model\u2019s outputs to ensure the synthetically generated data is suitable, safe and accurate for their use case.<\/p>\n<p dir=\"ltr\">For more information on model security and safety evaluation, read the model card.<\/p>\n<p dir=\"ltr\">Download Nemotron-4 340B models via\u00a0<a href=\"https:\/\/huggingface.co\/collections\/nvidia\/nemotron-4-340b-666b7ebaf1b3867caf2f1911\" target=\"_blank\" rel=\"noopener\">Hugging Face<\/a>. For more details, read the <a target=\"_blank\" href=\"https:\/\/research.nvidia.com\/publication\/2024-06_nemotron-4-340b\" rel=\"noopener\">research papers on the model<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2406.08673\" target=\"_blank\" rel=\"noopener\">dataset<\/a>.<\/p>\n<p><i>See <\/i><a href=\"https:\/\/www.nvidia.com\/en-eu\/about-nvidia\/terms-of-service\/\" target=\"_blank\" rel=\"noopener\"><i>notice<\/i><\/a><i> regarding software product information.<\/i><\/p>\n<p>\t\t<!-- .entry-footer --><\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blogs.nvidia.com\/blog\/nemotron-4-synthetic-data-generation-llm-training\/<\/p>\n","protected":false},"author":0,"featured_media":3634,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/3633"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=3633"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/3633\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/3634"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=3633"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=3633"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=3633"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}