{"id":1872,"date":"2022-03-01T18:40:21","date_gmt":"2022-03-01T18:40:21","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/01\/train-175-billion-parameter-nlp-models-with-model-parallel-additions-and-hugging-face-on-amazon-sagemaker\/"},"modified":"2022-03-01T18:40:21","modified_gmt":"2022-03-01T18:40:21","slug":"train-175-billion-parameter-nlp-models-with-model-parallel-additions-and-hugging-face-on-amazon-sagemaker","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/01\/train-175-billion-parameter-nlp-models-with-model-parallel-additions-and-hugging-face-on-amazon-sagemaker\/","title":{"rendered":"Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker"},"content":{"rendered":"<div id=\"\">\n<p>The last few years have seen rapid development in the field of natural language processing (NLP). While hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly run into issues scaling their large language models across multiple GPU\u2019s.<\/p>\n<p>In this blog post, we briefly summarize the rise of large- and small- scale NLP models, primarily through the abstraction provided by Hugging Face and with the modular backend of Amazon SageMaker. In particular we highlight the launch of four additional features within the SageMaker model parallel library that unlock\u00a0175 billion parameter NLP model pretraining and fine-tuning for customers.<\/p>\n<p>We used this library on the SageMaker training platform and achieved a throughput of 32\u00a0samples per second on 120 ml.p4d.24xlarge instances and 175 billion parameters. We anticipate that if we increased this up to 240 instances, the full model would take 25 days to train.<\/p>\n<p>For more information about model parallelism, see the paper <a href=\"https:\/\/arxiv.org\/pdf\/2111.05972.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training<\/a>.<\/p>\n<p>You can also see the GPT2 notebook we used to generate these performance numbers on our <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/main\/training\/distributed_training\/pytorch\/model_parallel\/gpt2\/smp-train-gpt-simple.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<\/a>.<\/p>\n<p>To learn more about how to use the new features within SageMaker model parallel, refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-parallel-extended-features-pytorch.html\" target=\"_blank\" rel=\"noopener noreferrer\">Extended Features of the SageMaker Model Parallel Library for PyTorch<\/a>, and <a href=\"https:\/\/sagemaker.readthedocs.io\/en\/stable\/api\/training\/smd_model_parallel_general.html\" target=\"_blank\" rel=\"noopener noreferrer\">Use with the SageMaker Python SDK<\/a>.<\/p>\n<h2><strong>NLP on Amazon SageMaker \u2013 Hugging Face and model parallelism<\/strong><\/h2>\n<p>If you\u2019re new to Hugging Face and NLP, the biggest highlight you need to know is that applications using natural language processing (NLP) are starting to achieve human level performance. This is largely driven by a learning mechanism, called <a href=\"https:\/\/arxiv.org\/pdf\/1706.03762.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">attention<\/a>,\u00a0 which gave rise to a deep learning model, called the <em>transformer<\/em>, that is much more scalable than previous deep learning sequential methods. The now-famous <a href=\"https:\/\/arxiv.org\/pdf\/1810.04805.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">BERT model<\/a> was developed to capitalize on the transformer, and developed several useful NLP tactics along the way. Transformers and the suite of models, both within and outside of NLP, which have all been inspired by BERT, <a href=\"https:\/\/blog.google\/products\/search\/search-language-understanding-bert\/\" target=\"_blank\" rel=\"noopener noreferrer\">are the primary engine behind your Google search results<\/a>, in your <a href=\"https:\/\/ai.googleblog.com\/2020\/06\/recent-advances-in-google-translate.html\" target=\"_blank\" rel=\"noopener noreferrer\">Google translate results<\/a>, and <a href=\"https:\/\/venturebeat.com\/2021\/12\/23\/open-source-nlp-is-fueling-a-new-wave-of-startups\/\" target=\"_blank\" rel=\"noopener noreferrer\">a host of new startups<\/a>.<\/p>\n<p>SageMaker and Hugging Face partnered to make this easier for customers than ever before. We\u2019ve launched Hugging Face deep learning containers (DLC\u2019s) for you to train and host pre-trained models directly from Hugging Face\u2019s <a href=\"https:\/\/huggingface.co\/models\" target=\"_blank\" rel=\"noopener noreferrer\">repository of over 26,000 models.<\/a>\u00a0We\u2019ve launched\u00a0<a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/tree\/master\/sagemaker-training-compiler\/huggingface\" target=\"_blank\" rel=\"noopener noreferrer\">the SageMaker Training Compiler\u00a0<\/a>for you to speed up the runtime of your Hugging Face training loops by up to 50%. We\u2019ve also integrated <a href=\"https:\/\/github.com\/huggingface\/transformers\" target=\"_blank\" rel=\"noopener noreferrer\">the Hugging Face flagship Transformers SDK <\/a>with <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/distributed-training.html\" target=\"_blank\" rel=\"noopener noreferrer\">our distributed training libraries<\/a>\u00a0to make scaling out your NLP models easier than ever before.<\/p>\n<p>For more information about Hugging Face Transformer models on Amazon SageMaker, see <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-parallel-extended-features-pytorch-hugging-face.html\" target=\"_blank\" rel=\"noopener noreferrer\">Support for Hugging Face Transformer Models.<\/a><\/p>\n<h2><strong>New features for large-scale NLP model training with the SageMaker model parallel library\u00a0<\/strong><\/h2>\n<p>At AWS re:Invent 2020, SageMaker launched distributed libraries that provide the best performance on the cloud for training computer vision models like <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/distributed-mask-rcnn-training-with-amazon-sagemakercv\/\" target=\"_blank\" rel=\"noopener noreferrer\">Mask-RCNN<\/a>\u00a0and NLP models like <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/aws-and-nvidia-achieve-the-fastest-training-times-for-mask-r-cnn-and-t5-3b\/\" target=\"_blank\" rel=\"noopener noreferrer\">T5-3B.<\/a>\u00a0This is possible through enhanced communication primitives that are 20-40% faster than NCCL on AWS, and model distribution techniques that enable extremely large language models to scale across tens to hundreds to thousands of GPUs.<\/p>\n<p>The SageMaker model parallel library (SMP) has always given you the ability to take your predefined NLP model in PyTorch, be that through Hugging Face or elsewhere, and partition that model onto multiple GPUs in your cluster. Said another way, SMP breaks up your model into smaller chunks so you don\u2019t experience out of memory (OOM) errors. We\u2019re pleased to add additional memory-saving techniques that are critical for large scale models, namely:<\/p>\n<ul>\n<li>Tensor parallelism<\/li>\n<li>Optimizer state sharding<\/li>\n<li>Activation checkpointing<\/li>\n<li>Activation offloading<\/li>\n<\/ul>\n<p>You can combine these four features can be combined to utilize memory more efficiently and train the next generation of extreme scale NLP models.<\/p>\n<h2><strong>Distributed training and tensor parallelism<\/strong><\/h2>\n<p>To understand tensor parallelism, it\u2019s helpful to know that there are many kinds of distributed training, or parallelism<strong>. <\/strong>You\u2019re probably already familiar with the most common type, <em>data parallelism.<\/em>\u00a0The core of data parallelism works like this: you add an extra node to your cluster, such as going from one to two ml.EC2 instances in your SageMaker estimator. Then, you use a data parallel framework like Horovod, PyTorch Distributed Data Parallel, or SageMaker Distributed. This creates replicas of your model, one per accelerator, and handles sharding out the data to each node, along with bringing all the results together during the back propagation step of your neural network. Think distributed gradient descent. Data parallelism is also popular within servers; you\u2019re sharding data into all the GPUs, and occasionally CPUs, on all of your nodes. The following diagram illustrates data parallelism.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image001.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33401\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image001.jpg\" alt=\"\" width=\"2304\" height=\"1056\"><\/a><\/p>\n<p><em>Model parallelism<\/em> is slightly different. Instead of making copies of the same model, we split your model into pieces. Then we manage the running it, so your data is still flowing through your neural network in exactly the same way mathematically, but different pieces of your model are sitting on different GPUs. If you\u2019re using an ml.p3.8xlarge, you\u2019ve got four NVIDIA V100\u2019s, so you\u2019d probably want to shard your model into 4 pieces, one piece per GPU. If you jump up to two ml.p4d.24xlarge\u2019s, that\u2019s 16 A100\u2019s total in your cluster, so you might break your model into 16 pieces. This is also sometimes called <em>pipeline parallelism.<\/em>\u00a0That\u2019s because the set of layers in the network are partitioned across GPUs, and run in a pipelined manner to maximize the GPU utilization. The following diagram illustrates model parallelism.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image003.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33402\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image003.jpg\" alt=\"\" width=\"2304\" height=\"1166\"><\/a><\/p>\n<p>To make model parallelism happen at scale, we need a third type of distribution:<em> tensor parallelism<\/em>. Tensor parallelism applies the same concepts at one step further\u2014we break apart the largest layers of your neural network and place parts of the layers themselves on different devices. This is relevant when you\u2019re working with 175 billion parameters or more, and trying to fit even a few records into RAM, along with parts of your model, to train that transformer. The following diagram illustrates tensor parallelism.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image005.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33403\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image005.jpg\" alt=\"\" width=\"2304\" height=\"1162\"><\/a><\/p>\n<p>To enable <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-parallel-extended-features-pytorch-tensor-parallelism-examples.html\" target=\"_blank\" rel=\"noopener noreferrer\">tensor parallelism, set it within the smp options<\/a> you pass to your estimator.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image007.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33404\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image007.jpg\" alt=\"\" width=\"796\" height=\"269\"><\/a><\/p>\n<p>In the preceding code, <code>pipeline_parallel_degree<\/code>\u00a0describes into how many segments your model should be sharded, based on the pipeline parallelism we discussed above. Another word for this is <em>partitions<\/em>.<\/p>\n<p>To enable tensor parallelism, set <code>tensor_parallel_degree<\/code>\u00a0 to your desired level. Make sure you\u2019re picking a number equal to or smaller than the number of GPU\u2019s per instance, so no greater than 8 for the ml.p4d.24xlarge machines. For additional script changes, refer to <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-parallel-extended-features-pytorch-tensor-parallelism-examples.html\" target=\"_blank\" rel=\"noopener noreferrer\">Run a SageMaker Distributed Model Parallel Training Job with Tensor Parallelism<\/a>.<\/p>\n<p>The ddp parameter refers to distributed data parallel. You typically enable this if you\u2019re using data parallelism or tensor parallelism, because the model parallelism library relies on DDP for these features.<\/p>\n<h2><strong>Optimizer state sharding, activation offloading and checkpoints<\/strong><\/h2>\n<p>If you have an extremely large model, you also need an extremely large optimizer state. Prepping your optimizer for SMP is straightforward: simply pick it up from disk in your script and load it into the <code>smp.DistributedOptimizer()<\/code> object.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image009.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33405\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image009.jpg\" alt=\"\" width=\"846\" height=\"135\"><\/a><\/p>\n<p>Make sure you enable this at the estimator by setting <code>shard_optimizer_state<\/code>\u00a0to True in the <code>smp_options<\/code> you use to configure SMP:<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image011.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33406\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image011.jpg\" alt=\"\" width=\"382\" height=\"39\"><\/a><\/p>\n<p>Similar to tensor and pipeline parallelism, SMP profiles your model and your world size (the total number of GPUs in all of your training nodes), to find the best placement strategies.<\/p>\n<p>In deep learning the intermediate layer outputs are also called activations, and these need to be stored during forward pass. This is because they need to be used for gradient computation in the backward pass. In a large model, storing all these activations simultaneously in memory can create significant memory bottlenecks. To address this bottleneck, you can use <em>activation checkpointing<\/em>, the third new feature in the SageMaker model parallelism library.\u00a0Activation checkpointing, or <em>gradient checkpointing<\/em>, is a technique to reduce memory usage by clearing activations of certain layers and recomputing them during a backward pass. This effectively trades extra computation time for reduced memory usage.<\/p>\n<p>Lastly, <em>activation offloading<\/em> directly uses activation checkpointing. It\u2019s a strategy to keep only a few tensor activations on the GPU RAM during the model training. Specifically, we move the checkpointed activations to CPU memory during the forward pass and load them back to GPU for the backward pass of a specific micro-batch.<\/p>\n<h2>\u00a0Micro-batches and placement strategies<\/h2>\n<p>Other topics that sometimes cause customers confusion are micro-batches and placement strategies. Both of these are hyperparameters you can supply to the SageMaker model parallel library. Specifically micro-batches are relevant when implementing models that rely on pipeline parallelism, such as those at least 30 billion parameters in size or more.<\/p>\n<p>Micro-batches are subsets of minibatches. When your model is in its training loop, you define a certain number of records to pick up and pass forward and backward through the layers\u2013this is called a <em>minibatch<\/em>, or sometimes just a\u00a0<em>batch<\/em>. A full pass through your dataset is called an <em>epoch<\/em>. To run forward and backward passes with pipeline parallelism, SageMaker model parallel library shards the batches into smaller subsets called micro-batches, which are run one at a time to maximize GPU utilization. The resultant, much smaller set of examples per-GPU, is called a micro-batch. In our GPT-2 example, <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/63be05af3354b532e5e401a966685e848202f413\/training\/distributed_training\/pytorch\/model_parallel\/gpt2\/train_gpt_simple.py#L809\" target=\"_blank\" rel=\"noopener noreferrer\">we added a default of 1 microbatch directly to the training script<\/a>.<\/p>\n<p>As you scale up your training configuration, <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/model-parallel-customize-tips-pitfalls.html\" target=\"_blank\" rel=\"noopener noreferrer\">you are strongly recommended to change your batch size and micro-batch size accordingly<\/a>. This is the only way to ensure good performance: you must consider batch size and micro-batch sizes as a function of your overall world size when relying on pipeline parallelism.<\/p>\n<p>Placement strategies\u00a0are how to tell SageMaker physically where to place your model partitions. If you\u2019re using both model parallel and data parallel, setting <code>placement_strategy<\/code>\u00a0to <code>\u201ccluster\u201d<\/code>\u00a0places model replicas in device ID\u2019s (GPUs) that are physically close to each other. However, if you really want to be more prescriptive about your parallelism strategy, you can break it down into a single string with different combinations of three letters: D for data parallelism, <code>P<\/code> indicates pipeline parallelism, and <code>T<\/code> for tensor parallelism. We generally recommend keeping the default placement of <code>\"cluster\"<\/code>, because this is most appropriate for large scale model training. The \u201ccluster\u201d\u00a0placement corresponds to \u201c<code>DPT<\/code>\u201c.<\/p>\n<p>For more information about placement strategies, see <a href=\"https:\/\/sagemaker.readthedocs.io\/en\/stable\/api\/training\/smd_model_parallel_general.html#placement-strategy-with-tensor-parallelism\" target=\"_blank\" rel=\"noopener noreferrer\">Placement Strategy with Tensor Parallelism<\/a>.<\/p>\n<h2>Example use case<\/h2>\n<p>Let\u2019s imagine you have one ml.p3.16xlarge in your training job. That gives you <a href=\"https:\/\/aws.amazon.com\/ec2\/instance-types\/\" target=\"_blank\" rel=\"noopener noreferrer\">8 NVIDIA V100\u2019s per node<\/a>. Remember, every time you add an extra instance, you experience additional bandwidth overhead, so it\u2019s always better to have more GP\u2019Us on a single node. In this case, you\u2019re better off with one ml.p3.16xlarge than, for example, two ml.p3.8xlarges. Even though the number of GPUs is the same, the extra bandwidth overhead of the extra node slows down your throughput.<\/p>\n<p>The following diagram illustrates four-way model parallelism, combined with two-way data parallelism. This means you actually have two replicas of your model (think data parallel), with each of them partitioned across four GPU\u2019s (model parallel).<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image013.jpg\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33407\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/ML-7958-image013.jpg\" alt=\"\" width=\"691\" height=\"404\"><\/a><\/p>\n<p>If any of those model partitions are too large to fit onto a single GPU, you can add an extra type of distribution\u2013tensor parallelism\u2013to spit it and utilize both devices.<\/p>\n<h2><strong>Conclusion<\/strong><\/h2>\n<p>In this blog post we discussed SageMaker distributed training libraries, especially focusing on model parallelism. We shared performance benchmarks from our latest test, achieving 32 samples per second\u00a0 across 120 ml.p4d.24xlarge instances and 175B parameters on Amazon SageMaker. We anticipate that if we increased this to 240 p4 instances we could train a 175B parameter model in 25 days.<\/p>\n<p>We also discussed the newest features the enable large-scale training, namely tensor parallelism, optimizer state sharding, activation checkpointing, and activation offloading. We shared some tips and tricks for enabling this through training on Amazon SageMaker.<\/p>\n<p>Try it out yourself <a href=\"https:\/\/github.com\/aws\/amazon-sagemaker-examples\/blob\/master\/training\/distributed_training\/pytorch\/model_parallel\/gpt2\/smp-train-gpt-simple.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">using the same notebook that generated our numbers, which is available on GitHub here<\/a>. You can also request more GPUs for your AWS account through <a href=\"https:\/\/docs.aws.amazon.com\/general\/latest\/gr\/aws_service_limits.html\" target=\"_blank\" rel=\"noopener noreferrer\">requesting a service limit approval right here<\/a>.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/20\/Emily-Webber.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-31931 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/12\/20\/Emily-Webber.png\" alt=\"\" width=\"100\" height=\"133\"><\/a>Emily Webber<\/strong> joined AWS just after SageMaker launched, and has been trying to tell the world about it ever since! Outside of building new ML experiences for customers, Emily enjoys meditating and studying Tibetan Buddhism.<\/p>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-8007 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2019\/03\/20\/aditya-bindal-100.jpg\" alt=\"\" width=\"100\" height=\"134\">Aditya Bindal<\/strong> is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to train deep learning models on AWS. In his spare time, he enjoys spending time with his daughter, playing tennis, reading historical fiction, and traveling.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/lquintel.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33410 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/22\/lquintel.png\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Luis Quintela\u00a0<\/strong>is the Software Developer Manager for the AWS SageMaker model parallel library. In his spare time, he can be found riding his Harley in the SF Bay Area.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/train-175-billion-parameter-nlp-models-with-model-parallel-additions-and-hugging-face-on-amazon-sagemaker\/<\/p>\n","protected":false},"author":0,"featured_media":1873,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1872"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1872"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1872\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1873"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1872"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1872"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1872"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}