{"id":4437,"date":"2026-02-10T09:40:10","date_gmt":"2026-02-10T09:40:10","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2026\/02\/10\/scale-llm-fine-tuning-with-hugging-face-and-amazon-sagemaker-ai\/"},"modified":"2026-02-10T09:40:10","modified_gmt":"2026-02-10T09:40:10","slug":"scale-llm-fine-tuning-with-hugging-face-and-amazon-sagemaker-ai","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2026\/02\/10\/scale-llm-fine-tuning-with-hugging-face-and-amazon-sagemaker-ai\/","title":{"rendered":"Scale LLM fine-tuning with Hugging Face and Amazon SageMaker AI"},"content":{"rendered":"<div id=\"\">\n<p>Enterprises are increasingly shifting from relying solely on large, general-purpose language models to developing specialized large language models (LLMs) fine-tuned on their own proprietary data. Although foundation models (FMs) offer impressive general capabilities, they often fall short when applied to the complexities of enterprise environments\u2014where accuracy, security, compliance, and domain-specific knowledge are non-negotiable.<\/p>\n<p>To meet these demands, organizations are adopting cost-efficient models tailored to their internal data and workflows. By fine-tuning on proprietary documents and domain-specific terminology, enterprises are building models that understand their unique context\u2014resulting in more relevant outputs, tighter data governance, and simpler deployment across internal tools.<\/p>\n<p>This shift is also a strategic move to reduce operational costs, improve inference latency, and maintain greater control over data privacy. As a result, enterprises are redefining their AI strategy as customized, right-sized models aligned to their business needs.<\/p>\n<p>Scaling LLM fine-tuning for enterprise use cases presents real technical and operational hurdles, which are being overcome through the powerful partnership between <a href=\"https:\/\/huggingface.co\/\" target=\"_blank\" rel=\"noopener\">Hugging Face<\/a> and <a href=\"https:\/\/aws.amazon.com\/sagemaker\/ai\/\" target=\"_blank\" rel=\"noopener\">Amazon SageMaker AI<\/a>.<\/p>\n<p>Many organizations face fragmented toolchains and rising complexity when adopting advanced fine-tuning techniques like <a href=\"https:\/\/huggingface.co\/docs\/diffusers\/training\/lora\" target=\"_blank\" rel=\"noopener\">Low-Rank Adaptation (LoRA)<\/a>, <a href=\"https:\/\/huggingface.co\/docs\/diffusers\/v0.36.0\/en\/quantization\/bitsandbytes#4-bit-qlora-algorithm\" target=\"_blank\" rel=\"noopener\">QLoRA<\/a>, and Reinforcement Learning with Human Feedback (RLHF). Additionally, the resource demands of large model training\u2014including memory limitations and distributed infrastructure challenges\u2014often slow down innovation and strains internal teams.<\/p>\n<p>To overcome this, SageMaker AI and Hugging Face have joined forces to simplify and scale model customization. By integrating the Hugging Face Transformers libraries into SageMaker\u2019s fully managed infrastructure, enterprises can now:<\/p>\n<ul>\n<li>Run distributed fine-tuning jobs out of the box, with built-in support for parameter-efficient tuning methods<\/li>\n<li>Use optimized compute and storage configurations that reduce training costs and improve GPU utilization<\/li>\n<li>Accelerate time to value by using familiar open source libraries in a production-grade environment<\/li>\n<\/ul>\n<p>This collaboration helps businesses focus on building domain-specific, right-sized LLMs, unlocking AI value faster while maintaining full control over their data and models.<\/p>\n<p>In this post, we show how this integrated approach transforms enterprise LLM fine-tuning from a complex, resource-intensive challenge into a streamlined, scalable solution for achieving better model performance in domain-specific applications. We use the <a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B\" target=\"_blank\" rel=\"noopener\">meta-llama\/Llama-3.1-8B<\/a> model, and execute a Supervised Fine-Tuning (SFT) job to improve the model\u2019s reasoning capabilities on the <a href=\"https:\/\/huggingface.co\/datasets\/UCSC-VLAA\/MedReason\" target=\"_blank\" rel=\"noopener\">MedReason<\/a> dataset by using distributed training and optimization techniques, such as Fully-Sharded Data Parallel (FSDP) and LoRA with the Hugging Face Transformers library, executed with Amazon SageMaker Training Jobs.<\/p>\n<h2>Understanding the core concepts<\/h2>\n<p>The Hugging Face Transformers library is an open-source toolkit designed to fine-tune LLMs by enabling seamless experimentation and deployment with popular transformer models.<\/p>\n<p>The Transformers library supports a variety of methods for aligning LLMs to specific objectives, including:<\/p>\n<ul>\n<li>Thousands of pre-trained models \u2013 Access to a vast collection of models like BERT, Meta Llama, Qwen, T5, and more, which can be used for tasks such as text classification, translation, summarization, question answering, object detection, and speech recognition.<\/li>\n<li>Pipelines API \u2013 Simplifies common tasks (such as sentiment analysis, summarization, and image segmentation) by handling tokenization, inference, and output formatting in a single call.<\/li>\n<li>Trainer API \u2013 Provides a high-level interface for training and fine-tuning models, supporting features like mixed precision, distributed training, and integration with popular hardware accelerators.<\/li>\n<li>Tokenization tools \u2013 Efficient and flexible tokenizers for converting raw text into model-ready inputs, supporting multiple languages and formats.<\/li>\n<\/ul>\n<p>SageMaker Training Jobs is a fully managed, on-demand machine learning (ML) service that runs remotely on AWS infrastructure to train a model using your data, code, and chosen compute resources. This service abstracts away the complexities of provisioning and managing the underlying infrastructure, so you can focus on developing and fine-tuning your ML and foundation models. Key capabilities offered by SageMaker training jobs are:<\/p>\n<ul>\n<li><strong>Fully managed<\/strong> \u2013 SageMaker handles resource provisioning, scaling, and management for your training jobs, so you don\u2019t need to manually set up servers or clusters.<\/li>\n<li><strong>Flexible input<\/strong> \u2013 You can use built-in algorithms, pre-built containers, or bring your own custom training scripts and Docker containers, to execute training workloads with most popular frameworks such as the Hugging Face Transformers library.<\/li>\n<li><strong>Scalable<\/strong> \u2013 It supports single-node or distributed training across multiple instances, making it suitable for both small and large-scale ML workloads.<\/li>\n<li><strong>Integration with multiple data sources<\/strong> \u2013 Training data can be stored in <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener\">Amazon Simple Storage Service<\/a> (Amazon S3), <a href=\"https:\/\/aws.amazon.com\/fsx\/\" target=\"_blank\" rel=\"noopener\">Amazon FSx<\/a>, and <a href=\"https:\/\/aws.amazon.com\/ebs\/\" target=\"_blank\" rel=\"noopener\">Amazon Elastic Block Store<\/a> (Amazon EBS), and output model artifacts are saved back to Amazon S3 after training is complete.<\/li>\n<li><strong>Customizable<\/strong> \u2013 You can specify hyperparameters, resource types (such as GPU or CPU instances), and other settings for each training job.<\/li>\n<li><strong>Cost-efficient options<\/strong> \u2013 Features like <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-managed-spot-training.html\" target=\"_blank\" rel=\"noopener\">managed Spot Instances<\/a>, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/reserve-capacity-with-training-plans.html\" target=\"_blank\" rel=\"noopener\">flexible training plans<\/a>, and <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/train-heterogeneous-cluster.html\" target=\"_blank\" rel=\"noopener\">heterogeneous clusters<\/a> help optimize training costs.<\/li>\n<\/ul>\n<h2>Solution overview<\/h2>\n<p>The following diagram illustrates the solution workflow of using the Hugging Face Transformers library with a SageMaker Training job.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter size-full wp-image-123953\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/02\/04\/Picture1-2.png\" alt=\"\" width=\"936\" height=\"594\"><\/p>\n<p>The workflow consists of the following steps:<\/p>\n<ol>\n<li>The user prepares the dataset by formatting it with the specific prompt style used for the selected model.<\/li>\n<li>The user prepares the training script by using the Hugging Face Transformers library to start the training workload, by specifying the configuration for the distribution option selected, such as Distributed Data Parallel (DDP) or Fully-Sharded Data Parallel (FSDP).<\/li>\n<li>The user submits an API request to SageMaker AI, passing the location of the training script, the Hugging Face Training container URI, and the training configurations required, such as distribution algorithm, instance type, and instance count.<\/li>\n<li>SageMaker AI uses the training job launcher script to run the training workload on a managed compute cluster. Based on the selected configuration, SageMaker AI provisions the required infrastructure, orchestrates distributed training, and upon completion, automatically decommissions the cluster.<\/li>\n<\/ol>\n<p>This streamlined architecture delivers a fully managed user experience, helping you quickly develop your training code, define training parameters, and select your preferred infrastructure. SageMaker AI handles the end-to-end infrastructure management with a pay-as-you-go pricing model that bills only for the net training time in seconds.<\/p>\n<h2>Prerequisites<\/h2>\n<p>You must complete the following prerequisites before you can run the Meta Llama 3.1 8B fine-tuning notebook:<\/p>\n<ol>\n<li>Make the following quota increase requests for SageMaker AI. For this use case, you will need to request a minimum of 1 p4d.24xlarge instance (with 8 x NVIDIA A100 GPUs) and scale to more p4d.24xlarge instances (depending on time-to-train and cost-to-train trade-offs for your use case). To help determine the right cluster size for the fine-tuning workload, you can use tools like VRAM Calculator or \u201c<a href=\"https:\/\/huggingface.co\/spaces\/Vokturz\/can-it-run-llm\" target=\"_blank\" rel=\"noopener\">Can it run LLM<\/a>\u201c. On the <a href=\"https:\/\/docs.aws.amazon.com\/servicequotas\/latest\/userguide\/intro.html\" target=\"_blank\" rel=\"noopener\">Service Quotas<\/a> console, request the following SageMaker AI quotas:\n<ul>\n<li>P4D instances (<code>p4.24xlarge<\/code>) for training job usage: 1<\/li>\n<\/ul>\n<\/li>\n<li>Create an <a href=\"https:\/\/aws.amazon.com\/iam\/\" target=\"_blank\" rel=\"noopener\">AWS Identity and Access Management<\/a> (IAM) <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/sagemaker-roles.html#:~:text=the%20following%20procedures.-%5B%E2%80%A6%5Dxecution%20role,-Use%20the%20following%20(\" target=\"_blank\" rel=\"noopener\">role<\/a> with managed policies <code>AmazonSageMakerFullAccess<\/code> and <code>AmazonS3FullAccess<\/code> to give required access to SageMaker AI to run the examples.<\/li>\n<li>Assign the following policy as a trust relationship to your IAM role:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-json\">{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Sid\": \"\",\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": [\n                    \"sagemaker.amazonaws.com\"\n                ]\n            },\n            \"Action\": \"sts:AssumeRole\"\n        }\n    ]\n}\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>(Optional) Create an <a href=\"https:\/\/aws.amazon.com\/sagemaker\/ai\/studio\/\" target=\"_blank\" rel=\"noopener\">Amazon SageMaker Studio<\/a> domain (refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/onboard-quick-start.html\" target=\"_blank\" rel=\"noopener\">Use quick setup for Amazon SageMaker AI<\/a>) to access <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio-updated-jl-user-guide.html\" target=\"_blank\" rel=\"noopener\">Jupyter notebooks<\/a> with the preceding role. You can also use JupyterLab in your local setup<\/li>\n<\/ol>\n<p>These permissions grant broad access and are not recommended for use in production environments. See the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener\">SageMaker Developer Guide<\/a> for guidance on defining more fine-grained permissions.<\/p>\n<h2>Prepare the dataset<\/h2>\n<p>To prepare the dataset, you must load the <a href=\"https:\/\/huggingface.co\/datasets\/UCSC-VLAA\/MedReason\" target=\"_blank\" rel=\"noopener\">UCSC-VLAA\/MedReason<\/a> dataset. MedReason is a large-scale, high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in LLMs. The following table shows an example of the data.<\/p>\n<table border=\"1\">\n<thead>\n<tr>\n<th>dataset_name<\/th>\n<th>id_in_dataset<\/th>\n<th>question<\/th>\n<th>answer<\/th>\n<th>reasoning<\/th>\n<th>options<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>medmcqa<\/td>\n<td>7131<\/td>\n<td>Urogenital Diaphragm is made up of the following\u2026<\/td>\n<td>Colle\u2019s fascia. Explanation: Colle\u2019s fascia do\u2026<\/td>\n<td>Finding reasoning paths:n1. Urogenital diaphr\u2026<\/td>\n<td>Answer Choices:nA. Deep transverse Perineusn\u2026<\/td>\n<\/tr>\n<tr>\n<td>medmcqa<\/td>\n<td>7133<\/td>\n<td>Child with Type I Diabetes. What is the advise\u2026<\/td>\n<td>After 5 years. Explanation: Screening for diab\u2026<\/td>\n<td>**Finding reasoning paths:**nn1. Type 1 Diab\u2026<\/td>\n<td>Answer Choices:nA. After 5 yearsnB. After 2 \u2026<\/td>\n<\/tr>\n<tr>\n<td>medmcqa<\/td>\n<td>7134<\/td>\n<td>Most sensitive test for H pylori is-<\/td>\n<td>\n<p>Biopsy urease test. Explanation:<\/p>\n<p>Davidson&amp;\u2026<\/p>\n<\/td>\n<td>**Finding reasoning paths:**nn1. Consider th\u2026<\/td>\n<td>Answer Choices:nA. Fecal antigen testnB. Bio\u2026<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We want to use the following columns for preparing our dataset:<\/p>\n<ul>\n<li><strong>question<\/strong> \u2013 The question being posed<\/li>\n<li><strong>answer<\/strong> \u2013 The correct answer to the question<\/li>\n<li><strong>reasoning<\/strong> \u2013 A detailed, step-by-step logical explanation of how to arrive at the correct answer<\/li>\n<\/ul>\n<p>We can use the following steps to format the input in the proper style used for Meta Llama 3.1, and configure the data channels for SageMaker training jobs on Amazon S3:<\/p>\n<ol>\n<li>Load the UCSC-VLAA\/MedReason dataset, using the first 10,000 rows of the original dataset:\n<pre><code class=\"lang-python\">from datasets import load_dataset\ndataset = load_dataset(\"UCSC-VLAA\/MedReason\", split=\"train[:10000]\")<\/code><\/pre>\n<\/li>\n<li>Apply the proper chat template to the dataset by using the <a href=\"https:\/\/huggingface.co\/learn\/llm-course\/chapter11\/2\" target=\"_blank\" rel=\"noopener\"><code>apply_chat_template<\/code><\/a> method of the Tokenizer:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from transformers import AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\ndef prepare_dataset(sample):\n\n    system_text = (\n        \"You are a deep-thinking AI assistant.nn\" \n        \"For every user question, first write your thoughts and reasoning inside ... tags, then provide your answer.\"\n    )\n\n    messages = []\n\n    messages.append({\"role\": \"system\", \"content\": system_text})\n    messages.append({\"role\": \"user\", \"content\": sample[\"question\"]})\n    messages.append(\n        {\n            \"role\": \"assistant\",\n            \"content\": f\"n{sample['reasoning']}nn{sample['answer']}\",\n        }\n    )\n\n    # Apply chat template\n    sample[\"text\"] = tokenizer.apply_chat_template(\n        messages, tokenize=False\n    )\n\n    return sample\n<\/code><\/pre>\n<\/p><\/div>\n<p>The function <code>prepare_dataset<\/code> will iterate over the elements of the dataset, and use the <code>apply_chat_template<\/code> function to have a prompt template in the following form:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-yaml\">system\n{{SYSTEM_PROMPT}}\nuser\n{{QUESTION}}\nassistant\n\n{{REASONING}}\n\n\n{{FINAL_ANSWER}}\n<\/code><\/pre>\n<\/p><\/div>\n<p>The following code is an example of the formatted prompt:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-yaml\">&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt; \nYou are a deep-thinking AI assistant. \nFor every user question, first write your thoughts and reasoning inside ... tags, then provide your answer.\n&lt;|eot_id|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt; \nA 66-year-old man presents to the emergency room with blurred vision, lightheadedness, and chest pain that started 30 minutes ago. The patient is awake and alert. \nHis history is significant for uncontrolled hypertension, coronary artery disease, and he previously underwent percutaneous coronary intervention. \nHe is afebrile. The heart rate is 102\/min, the blood pressure is 240\/135 mm Hg, and the O2 saturation is 100% on room air. \nAn ECG is performed and shows no acute changes. A rapid intravenous infusion of a drug that increases peripheral venous capacitance is started. \nThis drug has an onset of action that is less than 1 minute with rapid serum clearance than necessitates a continuous infusion. What is the most severe side effect of this medication?\n&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt; \n \n### Finding Reasoning Paths: \n1. **Blurred vision, lightheadedness, and chest pain** \u2192 Malignant hypertension \u2192 Rapid IV antihypertensive therapy. \n2. **Uncontrolled hypertension and coronary artery disease** \u2192 Malignant hypertension \u2192 Rapid IV antihypertensive therapy. \n3. **Severe hypertension (BP 240\/135 mm Hg)** \u2192 Risk of end-organ damage \u2192 Malignant hypertension \u2192 Rapid IV antihypertensive therapy. \n4. **Chest pain and history of coronary artery disease** \u2192 Risk of myocardial ischemia \u2192 Malignant hypertension \u2192 Rapid IV antihypertensive therapy. --- \n\n### Reasoning Process: \n1. **Clinical Presentation and Diagnosis**:  - The patient presents with blurred vision...\n...\n \n\nCyanide poisoning\n&lt;|eot_id|&gt;&lt;|end_of_text|&gt;\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Split the dataset into train, validation, and test datasets:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from datasets import Dataset, DatasetDict\nfrom random import randint\n\ntrain_dataset = Dataset.from_pandas(train)\nval_dataset = Dataset.from_pandas(val)\ntest_dataset = Dataset.from_pandas(test)\n\ndataset = DatasetDict({\"train\": train_dataset, \"val\": val_dataset})\ntrain_dataset = dataset[\"train\"].map(\n    prepare_dataset, remove_columns=list(train_dataset.features)\n)\n\nval_dataset = dataset[\"val\"].map(\n    prepare_dataset, remove_columns=list(val_dataset.features)\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Prepare the training and validation datasets for the SageMaker training job by saving them as JSON files and constructing the S3 paths where these files will be uploaded:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">...\n \ntrain_dataset.to_json(\".\/data\/train\/dataset.jsonl\")\nval_dataset.to_json(\".\/data\/val\/dataset.jsonl\")\n\n \ns3_client.upload_file(\n    \".\/data\/train\/dataset.jsonl\", bucket_name, f\"{input_path}\/train\/dataset.jsonl\"\n)\ns3_client.upload_file(\n    \".\/data\/val\/dataset.jsonl\", bucket_name, f\"{input_path}\/val\/dataset.jsonl\"\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<\/ol>\n<h2>Prepare the training script<\/h2>\n<p>To fine-tune <a href=\"https:\/\/huggingface.co\/meta-llama\/Llama-3.1-8B\" target=\"_blank\" rel=\"noopener\">meta-llama\/Llama-3.1-8B<\/a> with a SageMaker Training job, we prepared the train.py file, which serves as the entry point of the training job to execute the fine-tuning workload.<\/p>\n<p>The training process can use <code>Trainer<\/code> or <code>SFTTrainer<\/code> classes to fine-tune our model. This simplifies the process of continued pre-training for LLMs. This approach makes fine-tuning efficient for adapting pre-trained models to specific tasks or domains.<\/p>\n<p>The <a href=\"https:\/\/huggingface.co\/docs\/transformers\/en\/main_classes\/trainer\"><code>Trainer<\/code><\/a> and <a href=\"https:\/\/huggingface.co\/docs\/trl\/sft_trainer\"><code>SFTTrainer<\/code><\/a> classes both facilitate model training with Hugging Face transformers. The <code>Trainer<\/code> class is the standard high-level API for training and evaluating transformer models on a wide range of tasks, including text classification, sequence labeling, and text generation. The <code>SFTTrainer<\/code> is a subclass built specifically for supervised fine-tuning of LLMs, particularly for instruction-following or conversational tasks.<\/p>\n<p>To accelerate the model fine-tuning, we distribute the training workload by using the FSDP technique. It is an advanced parallelism technique designed to train large models that might not fit in the memory of a single GPU, with the following benefits:<\/p>\n<ul>\n<li><strong>Parameter sharding<\/strong> \u2013 Instead of replicating the entire model on each GPU, FSDP splits (shards) model parameters, optimizer states, and gradients across GPUs<\/li>\n<li><strong>Memory efficiency<\/strong> \u2013 By sharding, FSDP drastically reduces the memory footprint on each device, enabling training of larger models or larger batch sizes<\/li>\n<li><strong>Synchronization<\/strong> \u2013 During training, FSDP gathers only the necessary parameters for each computation step, then releases memory immediately after, further saving resources<\/li>\n<li><strong>CPU offload<\/strong> \u2013 Optionally, FSDP can offload some data to CPUs to save even more GPU memory<\/li>\n<\/ul>\n<ol>\n<li>In our example, we use the <code>Trainer<\/code> class and define the required <code>TrainingArguments<\/code> to execute the FSDP distributed workload:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from transformers import (\n    Trainer,\n    TrainingArguments\n)\n\ntrainer = Trainer(\n    model=model,\n    train_dataset=train_ds,\n    eval_dataset=test_ds if test_ds is not None else None,\n    args=transformers.TrainingArguments(\n        **training_args, \n    ),\n    callbacks=callbacks,\n    data_collator=transformers.DataCollatorForLanguageModeling(\n        tokenizer, mlm=False\n    )\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>To further optimize the fine-tuning workload, we use the <a href=\"https:\/\/arxiv.org\/abs\/2305.14314\" target=\"_blank\" rel=\"noopener\">QLoRA<\/a> technique, which quantizes a pre-trained language model to 4 bits and attaches small Low-Rank Adapters, which are fine-tuned:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from transformers import (\n    AutoModelForCausalLM,\n    AutoTokenizer,\n    BitsAndBytesConfig,\n)\n\n# Load the tokenizer\ntokenizer = AutoTokenizer.from_pretrained(script_args.model_id)\n\n# Define PAD token\ntokenizer.pad_token = tokenizer.eos_token\n\n# Configure quantization\nbnb_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_use_double_quant=True,\n    bnb_4bit_quant_type=\"nf4\",\n    bnb_4bit_compute_dtype=torch.bfloat16,\n    bnb_4bit_quant_storage=torch.bfloat16\n)\n\n# Load the model\nmodel = AutoModelForCausalLM.from_pretrained(\n    script_args.model_id,\n    trust_remote_code=True,\n    quantization_config=bnb_config,\n    use_cache=not training_args.gradient_checkpointing,\n    cache_dir=\"\/tmp\/.cache\",\n    **model_configs,\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>The <code>script_args<\/code> and <code>training_args<\/code> are provided as hyperparameters for the SageMaker Training job in a configuration recipe <code>.yaml<\/code> file and parsed in the <code>train.py<\/code> file by using the <code>TrlParser<\/code> class provided by Hugging Face TRL:\n<div class=\"hide-language\">\n<pre><code class=\"lang-yaml\">model_id: \"meta-llama\/Llama-3.1-8B-Instruct\"      # Hugging Face model id\n# sagemaker specific parameters\noutput_dir: \"\/opt\/ml\/model\"                       # path to where SageMaker will upload the model \ncheckpoint_dir: \"\/opt\/ml\/checkpoints\/\"            # path to where SageMaker will upload the model checkpoints\ntrain_dataset_path: \"\/opt\/ml\/input\/data\/train\/\"   # path to where S3 saves train dataset\nval_dataset_path: \"\/opt\/ml\/input\/data\/val\/\"       # path to where S3 saves test dataset\nsave_steps: 100                                   # Save checkpoint every this many steps\ntoken: \"\"\n# training parameters\nlora_r: 32\nlora_alpha:64\nlora_dropout: 0.1                 \nlearning_rate: 2e-4                    # learning rate scheduler\nnum_train_epochs: 2                    # number of training epochs\nper_device_train_batch_size: 4         # batch size per device during training\nper_device_eval_batch_size: 2          # batch size for evaluation\ngradient_accumulation_steps: 4         # number of steps before performing a backward\/update pass\ngradient_checkpointing: true           # use gradient checkpointing\nbf16: true                             # use bfloat16 precision\ntf32: false                            # use tf32 precision\nfsdp: \"full_shard auto_wrap offload\"   #FSDP configurations\nfsdp_config: \n    backward_prefetch: \"backward_pre\"\n    cpu_ram_efficient_loading: true\n    offload_params: true\n    forward_prefetch: false\n    use_orig_params: true\nwarmup_steps: 100\nweight_decay: 0.01\nmerge_weights: true                    # merge weights in the base model\n<\/code><\/pre>\n<\/p><\/div>\n<p>For the implemented use case, we decided to fine-tune the adapter with the following values:<\/p>\n<ul>\n<li><strong>lora_r<\/strong>: 32 \u2013 Allows the adapter to capture more complex reasoning transformations.<\/li>\n<li><strong>lora_alpha<\/strong>: 64 \u2013 Given the reasoning task we are trying to improve, this value allows the adapter to have a significant impact to the base.<\/li>\n<li><strong>lora_dropout<\/strong>: 0.05 \u2013 We want to preserve reasoning connection by avoiding breaking important ones.<\/li>\n<li><strong>warmup_steps<\/strong>: 100 \u2013 Gradually increases the learning rate to the specified value. For this reasoning task, we want the model to learn a new structure without forgetting the previous knowledge.<\/li>\n<li><strong>weight_decay<\/strong>: 0.01 \u2013 Maintains model generalization.<\/li>\n<\/ul>\n<\/li>\n<li>Prepare the configuration file for the SageMaker Training job by saving them as JSON files and constructing the S3 paths where these files will be uploaded:\n<div class=\"hide-language\">\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import os\n\nif default_prefix:\n    input_path = f\"{default_prefix}\/datasets\/llm-fine-tuning-modeltrainer-sft\"\nelse:\n    input_path = f\"datasets\/llm-fine-tuning-modeltrainer-sft\"\n\ntrain_config_s3_path = f\"s3:\/\/{bucket_name}\/{input_path}\/config\/args.yaml\"\n\n# upload the model yaml file to s3\nmodel_yaml = \"args.yaml\"\ns3_client.upload_file(model_yaml, bucket_name, f\"{input_path}\/config\/args.yaml\")\nos.remove(\".\/args.yaml\")\n\nprint(f\"Training config uploaded to:\")\nprint(train_config_s3_path)<\/code><\/pre>\n<\/p><\/div>\n<\/p><\/div>\n<\/li>\n<\/ol>\n<h2>SFT training using a SageMaker Training job<\/h2>\n<p>To run a fine-tuning workload using the SFT training script and SageMaker Training jobs, we use the <a href=\"https:\/\/sagemaker.readthedocs.io\/en\/stable\/training\/index.html\" target=\"_blank\" rel=\"noopener\">ModelTrainer<\/a> class.<\/p>\n<p>The <code>ModelTrainer<\/code> class is a and more intuitive approach to model training that significantly enhances user experience and supports distributed training, Build Your Own Container (BYOC), and recipes. For additional information refer to the <a href=\"https:\/\/sagemaker.readthedocs.io\/en\/stable\/index.html\">SageMaker Python SDK documentation<\/a>.<\/p>\n<p>Set up the fine-tuning workload with the following steps:<\/p>\n<ol>\n<li>Specify the instance type, the container image for the training job, and the checkpoint path where the model will be stored:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">instance_type = \"ml.p4d.24xlarge\"\ninstance_count = 1\n\nimage_uri = image_uris.retrieve(\n    framework=\"huggingface\",\n    region=sagemaker_session.boto_session.region_name,\n    version=\"4.56.2\",\n    base_framework_version=\"pytorch2.8.0\",\n    instance_type=instance_type,\n    image_scope=\"training\",\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Define the source code configuration by pointing to the created <code>train.py<\/code>:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from sagemaker.train.configs import SourceCode\n\nsource_code = SourceCode(\n    source_dir=\".\/scripts\",\n    requirements=\"requirements.txt\",\n    entry_script=\"train.py\",\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Configure the training compute by optionally providing the parameter <code>keep_alive_period_in_seconds<\/code> to use <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/train-warm-pools.html\" target=\"_blank\" rel=\"noopener\">managed warm pools<\/a>, to retain and reuse the cluster during the experimentation phase:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from sagemaker.train.configs Compute\n\ncompute_configs = Compute(\n    instance_type=instance_type,\n    instance_count=instance_count,\n    keep_alive_period_in_seconds=0,\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Create the <code>ModelTrainer<\/code> function by providing the required training setup, and define the argument <code>distributed=Torchrun()<\/code> to use torchrun as a launcher to execute the training job in a distributed manner across the available GPUs in the selected instance:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from sagemaker.train.configs import (\n    CheckpointConfig,\n    OutputDataConfig,\n    StoppingCondition,\n)\nfrom sagemaker.train.distributed import Torchrun\nfrom sagemaker.train.model_trainer import ModelTrainer\n\n\n# define Training Job Name\njob_name = f\"train-{model_id.split('\/')[-1].replace('.', '-')}-sft\"\n\n# define OutputDataConfig path\noutput_path = f\"s3:\/\/{bucket_name}\/{job_name}\"\n\n# Define the ModelTrainer\nmodel_trainer = ModelTrainer(\n    training_image=image_uri,\n    source_code=source_code,\n    base_job_name=job_name,\n    compute=compute_configs,\n    distributed=Torchrun(),\n    stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),\n    hyperparameters={\n        \"config\": \"\/opt\/ml\/input\/data\/config\/args.yaml\"  # path to TRL config which was uploaded to s3\n    },\n    output_data_config=OutputDataConfig(s3_output_path=output_path),\n    checkpoint_config=CheckpointConfig(\n        s3_uri=output_path + \"\/checkpoint\", local_path=\"\/opt\/ml\/checkpoints\"\n    ),\n) \n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Set up the input channels for the <code>ModelTrainer<\/code> by creating <code>InputData<\/code> objects from the provided S3 bucket paths for the training and validation dataset, and for the configuration parameters:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">from sagemaker.train.configs import InputData\n# Pass the input data\ntrain_input = InputData(\n    channel_name=\"train\",\n    data_source=train_dataset_s3_path, # S3 path where training data is stored\n)\nval_input = InputData(\n    channel_name=\"val\",\n    data_source=val_dataset_s3_path, # S3 path where validation data is stored\n)\nconfig_input = InputData(\n    channel_name=\"config\",\n    data_source=train_config_s3_path, # S3 path where configurations are stored\n)\n# Check input channels configured\ndata = [train_input, val_input, config_input]\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Submit the training job:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">model_trainer.train(input_data_config=data, wait=False)<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<\/ol>\n<p>The training job with Flash Attention 2 for one epoch with a dataset of 10,000 samples takes approximately 18 minutes to complete.<\/p>\n<h2>Deploy and test fine-tuned Meta Llama 3.1 8B on SageMaker AI<\/h2>\n<p>To evaluate your fine-tuned model, you have several options. You can use an additional SageMaker Training job to evaluate the model with <a href=\"https:\/\/www.philschmid.de\/sagemaker-evaluate-llm-lighteval\" target=\"_blank\" rel=\"noopener\">Hugging Face Lighteval<\/a> on SageMaker AI, or you can deploy the model to a <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/realtime-endpoints.html\" target=\"_blank\" rel=\"noopener\">SageMaker real-time endpoint<\/a> and interactively test the model by using techniques like LLM as judge to compare generated content with ground truth content. For a more comprehensive evaluation that demonstrates the impact of fine-tuning on model performance, you can use the <a href=\"https:\/\/github.com\/UCSC-VLAA\/MedReason#-evaluation\" target=\"_blank\" rel=\"noopener\">MedReason evaluation script<\/a> to compare the base meta-llama\/Llama-3.1-8B model with your fine-tuned version.<\/p>\n<p>In this example, we use the deployment approach, iterating over the test dataset and evaluating the model on those samples using a simple loop.<\/p>\n<ol>\n<li>Select the instance type and the container image for the endpoint:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import boto3\n\nsm_client = boto3.client(\"sagemaker\", region_name=sess.boto_region_name)\n\nimage_uri = \"763104351884.dkr.ecr.us-east-1.amazonaws.com\/vllm:0.13-gpu-py312\"\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Create the SageMaker Model using the <a href=\"https:\/\/github.com\/aws\/model-hosting-container-standards\/blob\/main\/docs\/sagemaker\/01_quickstart.md\" target=\"_blank\" rel=\"noopener\">container URI for vLLM<\/a> and the S3 path to your model. Set your vLLM configuration, including the number of GPUs and max input tokens. For a full list of configuration options, see <a href=\"https:\/\/docs.vllm.ai\/en\/latest\/configuration\/engine_args\/\" target=\"_blank\" rel=\"noopener\">vLLM engine arguments<\/a>.\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">env = {\n    \"SM_VLLM_MODEL\": \"\/opt\/ml\/model\",\n    \"SM_VLLM_DTYPE\": \"bfloat16\",\n    \"SM_VLLM_GPU_MEMORY_UTILIZATION\": \"0.8\",\n    \"SM_VLLM_MAX_MODEL_LEN\": json.dumps(1024 * 16),\n    \"SM_VLLM_MAX_NUM_SEQS\": \"1\",\n    \"SM_VLLM_ENABLE_CHUNKED_PREFILL\": \"true\",\n    \"SM_VLLM_KV_CACHE_DTYPE\": \"auto\",\n    \"SM_VLLM_TENSOR_PARALLEL_SIZE\": \"4\",\n}\n\nmodel_response = sm_client.create_model(\n    ModelName=f\"{model_id.split('\/')[-1].replace('.', '-')}-model\",\n    ExecutionRoleArn=role,\n    PrimaryContainer={\n        \"Image\": image_uri,\n        \"Environment\": env,\n        \"ModelDataSource\": {\n            \"S3DataSource\": {\n                \"S3Uri\": f\"s3:\/\/{bucket_name}\/{job_prefix}\/{job_name}\/output\/model.tar.gz\",\n                \"S3DataType\": \"S3Prefix\",\n                \"CompressionType\": \"Gzip\",\n            }\n        },\n    },\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Create the <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/clarify-online-explainability-create-endpoint.html\" target=\"_blank\" rel=\"noopener\">endpoint configuration<\/a> by specifying the type and number of instances:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">instance_count = 1\ninstance_type = \"ml.g5.12xlarge\"\nhealth_check_timeout = 700\n\nendpoint_config_response = sm_client.create_endpoint_config(\n    EndpointConfigName=f\"{model_id.split('\/')[-1].replace('.', '-')}-config\",\n    ProductionVariants=[\n        {\n            \"VariantName\": \"AllTraffic\",\n            \"ModelName\": f\"{model_id.split('\/')[-1].replace('.', '-')}-model\",\n            \"InstanceType\": instance_type,\n            \"InitialInstanceCount\": instance_count,\n            \"ModelDataDownloadTimeoutInSeconds\": health_check_timeout,\n            \"ContainerStartupHealthCheckTimeoutInSeconds\": health_check_timeout,\n            \"InferenceAmiVersion\": \"al2-ami-sagemaker-inference-gpu-3-1\",\n        }\n    ],\n)\n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<li>Deploy the model:\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">endpoint_response = sm_client.create_endpoint(\n    EndpointName=f\"{model_id.split('\/')[-1].replace('.', '-')}-sft\", \n    EndpointConfigName=f\"{model_id.split('\/')[-1].replace('.', '-')}-config\",\n) \n<\/code><\/pre>\n<\/p><\/div>\n<\/li>\n<\/ol>\n<p>SageMaker AI will now create the endpoint and deploy the model to it. This can take 5\u201310 minutes. Afterwards, you can test the model by sending some example inputs to the endpoint. You can use the <code>invoke_endpoint<\/code> method of the <code>sagemaker-runtime<\/code> client to send the input to the model and get the output:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">import json\nimport pandas as pd\n\neval_dataset = []\n\nfor index, el in enumerate(test_dataset, 1):\n    print(\"Processing item \", index)\n\n    payload = {\n        \"messages\": [\n            {\n                \"role\": \"system\",\n                \"content\": \"You are a deep-thinking AI assistant.nnFor every user question, first write your thoughts and reasoning inside &lt;think&gt;...&lt;\/think&gt; tags, then provide your answer.\",\n            },\n            {\"role\": \"user\", \"content\": el[\"question\"]},\n        ],\n        \"max_tokens\": 4096,\n        \"stop\": [\"&lt;|eot_id|&gt;\", \"&lt;|end_of_text|&gt;\"],\n        \"temperature\": 0.4,\n        \"top_p\": 0.9,\n        \"repetition_penalty\": 1.15,\n        \"no_repeat_ngram_size\": 3,\n        \"do_sample\": True,\n    }\n\n    response = predictor.invoke_endpoint(\n        EndpointName=endpoint_name,\n        ContentType=\"application\/json\",\n        Body=json.dumps(payload),\n    )\n\n    result = json.loads(response[\"Body\"].read().decode())\n    eval_dataset.append([el[\"question\"], result[\"choices\"][0][\"message\"][\"content\"]])\n\n    print(\"**********************************************\")\n\neval_dataset_df = pd.DataFrame(\n    eval_dataset, columns=[\"question\", \"answer\"]\n)\n\neval_dataset_df.to_json(\n    \".\/eval_dataset_results.jsonl\", orient=\"records\", lines=True\n)<\/code><\/pre>\n<\/p><\/div>\n<p>The following are some examples of generated answers:<\/p>\n<div class=\"hide-language\">\n<pre class=\"unlimited-height-code\"><code class=\"lang-python\">Question: \"Perl's stain or prussion blue test is for:\"\n\nAnswer Fine-tuned: \"\"\"\n&lt;think&gt;\nThe Perl's stain or Prussian blue test is used to detect the presence of iron in biological samples. \nIt involves adding potassium ferrocyanide (K4[Fe(CN)6]) to the sample, \nwhich reacts with the iron ions present in it to form a dark blue-colored compound known as ferric ferrocyanide. \nThis reaction can be observed visually, allowing researchers to determine if iron is present in the sample.\n&lt;\/think&gt;\n\nIn simpler terms, the Perl's stain or Prussian blue test is used to identify iron in biological samples.\n\"\"\"<\/code><\/pre>\n<\/p><\/div>\n<p>The fine-tuned model shows strong reasoning capabilities by providing structured, detailed explanations with clear thought processes, breaking down the concepts step-by-step before arriving at the final answer. This example showcases the effectiveness of our fine-tuning approach using Hugging Face Transformers and a SageMaker Training job.<\/p>\n<h2>Clean up<\/h2>\n<p>To clean up your resources to avoid incurring additional charges, follow these steps:<\/p>\n<ol>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio-updated-jl-admin-guide-clean-up.html\" target=\"_blank\" rel=\"noopener\">Delete any unused SageMaker Studio resources<\/a>.<\/li>\n<li>(Optional) <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/gs-studio-delete-domain.html\" target=\"_blank\" rel=\"noopener\">Delete the SageMaker Studio domain<\/a>.<\/li>\n<li>Verify that your training job isn\u2019t running anymore. To do so, on the SageMaker console, under <strong>Training<\/strong> in the navigation pane, choose <strong>Training<\/strong> <strong>jobs<\/strong>.<\/li>\n<li><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/realtime-endpoints-delete-resources.html\" target=\"_blank\" rel=\"noopener\">Delete the SageMaker endpoint<\/a>.<\/li>\n<\/ol>\n<h2>Conclusion<\/h2>\n<p>In this post, we demonstrated how enterprises can efficiently scale fine-tuning of both small and large language models by using the integration between the Hugging Face Transformers library and SageMaker Training jobs. This powerful combination transforms traditionally complex and resource-intensive processes into streamlined, scalable, and production-ready workflows.<\/p>\n<p>Using a practical example with the meta-llama\/Llama-3.1-8B model and the MedReason dataset, we demonstrated how to apply advanced techniques like FSDP and LoRA to reduce training time and cost\u2014without compromising model quality.<\/p>\n<p>This solution highlights how enterprises can effectively address common LLM fine-tuning challenges such as fragmented toolchains, high memory and compute requirements, and multi-node scaling inefficiencies and GPU underutilization.<\/p>\n<p>By using the integrated Hugging Face and SageMaker architecture, businesses can now build and deploy customized, domain-specific models faster\u2014with greater control, cost-efficiency, and scalability.<\/p>\n<p>To get started with your own LLM fine-tuning project, explore the code samples provided in our <a href=\"https:\/\/github.com\/brunopistone\/amazon-sagemaker-generativeai\/blob\/main\/3_distributed_training\/models\/meta-llama-3.1-8b\/sft_llama_31_8b.ipynb\" target=\"_blank\" rel=\"noopener\">GitHub repository<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-thumbnail wp-image-123958\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/02\/04\/photo1-100x100.png\" alt=\"\" width=\"100\" height=\"100\"><strong>Florent Gbelidji<\/strong> is a Machine Learning Engineer for Customer Success at Hugging Face. Based in Paris, France, Florent joined Hugging Face 3.5 years ago as an ML Engineer in the Expert Acceleration Program, helping companies build solutions with open source AI. He is now the Cloud Partnership Tech Lead for the AWS account, driving integrations between the Hugging Face environment and AWS services.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-thumbnail wp-image-123959\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/02\/04\/photo2-100x148.jpg\" alt=\"\" width=\"100\" height=\"148\"><strong>Bruno Pistone<\/strong> is a Senior Worldwide Generative AI\/ML Specialist Solutions Architect at AWS based in Milan, Italy. He works with AWS product teams and large customers to help them fully understand their technical needs and design AI and machine learning solutions that take full advantage of the AWS cloud and Amazon ML stack. His expertise includes distributed training and inference workloads, model customization, generative AI, and end-to-end ML. He enjoys spending time with friends, exploring new places, and traveling to new destinations.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-thumbnail wp-image-123960\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/02\/04\/photo3-100x133.png\" alt=\"\" width=\"100\" height=\"133\"><strong>Louise Ping<\/strong> is a Senior Worldwide GenAI Specialist, where she helps partners build go-to-market strategies and leads cross-functional initiatives to expand opportunities and drive adoption. Drawing from her diverse AWS experience across Storage, APN Partner Marketing, and AWS Marketplace, she works closely with strategic partners like Hugging Face to drive technical collaborations. When not working at AWS, she attempts home improvement projects\u2014ideally with limited mishaps.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-thumbnail wp-image-123961\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2026\/02\/04\/photo4-100x86.png\" alt=\"\" width=\"100\" height=\"86\"><strong>Safir Alvi<\/strong> is a Worldwide GenAI\/ML Go-To-Market Specialist at AWS based in New York. He focuses on advising strategic global customers on scaling their model training and inference workloads on AWS, and driving adoption of Amazon SageMaker AI Training Jobs and Amazon SageMaker HyperPod. He specializes in optimizing and fine-tuning generative AI and machine learning models across diverse industries, including financial services, healthcare, automotive, and manufacturing.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/scale-llm-fine-tuning-with-hugging-face-and-amazon-sagemaker-ai\/<\/p>\n","protected":false},"author":0,"featured_media":4438,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/4437"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=4437"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/4437\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/4438"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=4437"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=4437"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=4437"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}