{"id":21,"date":"2020-08-17T07:53:06","date_gmt":"2020-08-17T07:53:06","guid":{"rendered":"https:\/\/machine-learning.webcloning.com\/2020\/08\/17\/ai-of-the-storm-how-we-built-the-most-powerful-industrial-computer-in-the-u-s-in-three-weeks-during-a-pandemic\/"},"modified":"2020-08-17T07:53:06","modified_gmt":"2020-08-17T07:53:06","slug":"ai-of-the-storm-how-we-built-the-most-powerful-industrial-computer-in-the-u-s-in-three-weeks-during-a-pandemic","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2020\/08\/17\/ai-of-the-storm-how-we-built-the-most-powerful-industrial-computer-in-the-u-s-in-three-weeks-during-a-pandemic\/","title":{"rendered":"AI of the Storm: How We Built the Most Powerful Industrial Computer in the U.S. in Three Weeks During a Pandemic"},"content":{"rendered":"<div data-url=\"https:\/\/blogs.nvidia.com\/blog\/2020\/08\/14\/making-selene-pandemic-ai\/\" data-title=\"AI of the Storm: How We Built the Most Powerful Industrial Computer in the U.S. in Three Weeks During a Pandemic\" readability=\"269.90937200128\">\n<p>In under a month amid the global pandemic, a small team assembled the world\u2019s seventh-fastest computer.<\/p>\n<p>Today that mega-system, called <a href=\"https:\/\/blogs.nvidia.com\/blog\/2020\/06\/22\/top500-isc-supercomputing\/\">Selene<\/a>, communicates with its operators on Slack, has its own robot attendant and is driving AI forward in automotive, healthcare and natural-language processing.<\/p>\n<p>While many supercomputers tap exotic, proprietary designs that take months to commission, Selene is based on an open architecture NVIDIA shares with its customers.<\/p>\n<p>The Argonne National Laboratory, outside Chicago, is using a system based on Selene\u2019s <a href=\"https:\/\/blogs.nvidia.com\/blog\/2020\/05\/14\/dgx-superpod-a100\/\">DGX SuperPOD design<\/a> to research ways to stop the coronavirus. The University of Florida will use the design to build <a href=\"https:\/\/blogs.nvidia.com\/blog\/2020\/07\/21\/university-of-florida-nvidia-ai-supercomputer\/\">the fastest AI computer in academia<\/a>.<\/p>\n<p>DGX SuperPODs are driving business results for companies like <a href=\"https:\/\/nvidianews.nvidia.com\/news\/continental-and-nvidia-partner-to-enable-worldwide-production-of-ai-self-driving-cars\">Continental in automotive<\/a>, Lockheed Martin in aerospace and Microsoft in cloud-computing services.<\/p>\n<h2><b>Birth of an AI System<\/b><\/h2>\n<p>The story of how and why NVIDIA built Selene starts in 2015.<\/p>\n<p>NVIDIA engineers started their first system-level design with two motivations. They wanted to build something both powerful enough to train the AI models their colleagues were building for autonomous vehicles and general purpose enough to serve the needs of any deep-learning researcher.<\/p>\n<p>The result was the <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-saturnv\/\">SATURNV cluster<\/a>, born in 2016 and based on the NVIDIA Pascal GPU. When the more powerful NVIDIA Volta GPU debuted a year later, the budding systems group\u2019s motivation and its designs expanded rapidly.<\/p>\n<p><iframe loading=\"lazy\" title=\"NVIDIA Data Center Tour\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/vY61ExKhnfA?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/p>\n<h2><b>AI Jobs Grow Beyond the Accelerator<\/b><\/h2>\n<p>\u201cWe\u2019re trying to anticipate what\u2019s coming based on what we hear from researchers, building machines that serve multiple uses and have long lifetimes, packing as much processing, memory and storage as possible,\u201d said Michael Houston, a chief architect who leads the systems team.<\/p>\n<p>As early as 2017, \u201cwe were starting to see new apps drive the need for multi-node training, demanding very high-speed communications between systems and access to high-speed storage,\u201d he said.<\/p>\n<p>AI models were growing rapidly, requiring multiple GPUs to handle them. Workloads were demanding new computing styles, like model parallelism, to keep pace.<\/p>\n<p>So, in fast succession, the team crafted ever larger clusters of V100-based NVIDIA DGX-2 systems, called <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-pod\/\">DGX PODs<\/a>. They used 32, then 64 DGX-2 nodes, culminating in a 96-node architecture dubbed the <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/resources\/nvidia-dgx-superpod-reference-architecture\/\">DGX SuperPOD<\/a>.<\/p>\n<p>They christened it Circe for the irresistible Greek goddess. It debuted in June 2019 at No. 22 on the TOP500 list of the world\u2019s fastest supercomputers and currently holds No. 23.<\/p>\n<h2><b>Cutting Cables in a Computing Jungle<\/b><\/h2>\n<p>Along the way, the team learned lessons about networking, storage, power and thermals. Those learnings got baked into the latest <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-systems\/\">NVIDIA DGX systems<\/a>, reference architectures and today\u2019s 280-node Selene.<\/p>\n<p>In the race through ever larger clusters to get to Circe, some lessons were hard won.<\/p>\n<p>\u201cWe tore everything out twice, we literally cut the cables out. It was the fastest way forward, but it still had a lot of downtime and cost. So we vowed to never do that again and set ease of expansion and incremental deployment as a fundamental design principle,\u201d said Houston.<\/p>\n<p>The team redesigned the overall network to simplify assembling the system.<\/p>\n<p>They defined modules of 20 nodes connected by relatively simple \u201cthin switches.\u201d Each of these so-called scalable units could be laid down, cookie-cutter style, turned on and tested before the next one was added.<\/p>\n<p>The design let engineers specify set lengths of cables that could be bundled together with Velcro at the factory. Racks could be labeled and mapped, radically simplifying the process of filling them with dozens of systems.<\/p>\n<h2><b>Doubling Up on InfiniBand<\/b><\/h2>\n<p>Early on, the team learned to split up compute, storage and management fabrics into independent planes, spreading them across more, faster network-interface cards.<\/p>\n<p>The ratio of NICs to GPU doubled to 1:1. So did their speeds, going from 100 Gbit per second InfiniBand in Circe to <a href=\"https:\/\/www.mellanox.com\/files\/doc-2020\/wp-introducing-200g-hdr-infiniband-solutions.pdf\">200G HDR InfiniBand<\/a> in Selene. The result was a 4x increase in the effective node bandwidth.<\/p>\n<p>Likewise, memory and storage links grew in capacity and throughput to handle jobs with hot, warm and cold storage needs. Four storage tiers spanned 100 terabyte\/second memory links to 100 Gbyte\/s storage pools.<\/p>\n<p>Power and thermals stayed within air-cooled limits. The default designs used 35kW racks typical in leased data centers, but they can stretch beyond 50kW for the most aggressive supercomputer centers and down to 7kW racks some telcos use.<\/p>\n<figure id=\"attachment_46408\" aria-describedby=\"caption-attachment-46408\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart.png\"><picture class=\"size-full wp-image-46408\"><source type=\"image\/webp\" srcset=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart.png.webp 1280w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-400x50.png.webp 400w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-672x84.png.webp 672w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-768x96.png.webp 768w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-842x105.png.webp 842w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-406x51.png.webp 406w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-188x24.png.webp 188w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/source><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart.png\" alt=\"\" width=\"1280\" height=\"160\" srcset=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart.png 1280w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-400x50.png 400w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-672x84.png 672w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-768x96.png 768w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-842x105.png 842w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-406x51.png 406w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/REDO-Selene-chart-188x24.png 188w\" sizes=\"(max-width: 1280px) 100vw, 1280px\"><\/picture><\/a><figcaption id=\"caption-attachment-46408\" class=\"wp-caption-text\">A portrait of Selene by the numbers.<\/figcaption><\/figure>\n<h2><b>Seeking the Big, Balanced System<\/b><\/h2>\n<p>The net result is a more balanced design that can handle today\u2019s many different workloads. That flexibility also gives researchers the freedom to explore new directions in AI and high performance computing.<\/p>\n<p>\u201cTo some extent HPC and AI both require max performance, but you have to look carefully at how you deliver that performance in terms of power, storage and networking as well as raw processing,\u201d said Julie Bernauer, who leads an advanced development team that\u2019s worked on all of NVIDIA\u2019s large-scale systems.<\/p>\n<h2><b>Skeleton Crews on Strict Protocols<\/b><\/h2>\n<p>The gains paid off in early 2020.<\/p>\n<p>Within days of the pandemic hitting, the first <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/nvidia-ampere-gpu-architecture\/\">NVIDIA Ampere architecture GPUs<\/a> arrived, and engineers faced the job of assembling the 280-node Selene.<\/p>\n<p>In the best of times, it can take dozens of engineers a few months to assemble, test and commission a supercomputer-class system. NVIDIA had to get Selene running in a few weeks to participate in industry benchmarks and fulfill obligations to customers like Argonne.<\/p>\n<p>And engineers had to stay well within public-health guidelines of the pandemic.<\/p>\n<p>\u201cWe had skeleton crews with strict protocols to keep staff healthy,\u201d said Bernauer.<\/p>\n<p>\u201cTo unbox and rack systems, we used two-person teams that didn\u2019t mix with the others \u2014 they even took vacation at the same time. And we did cabling with six-foot distances between people. That really changes how you build systems,\u201d she said.<\/p>\n<p>Even with the COVID restrictions, engineers racked up to 60 systems in a day, the maximum their loading dock could handle. Virtual log-ins let administrators validate cabling remotely, testing the 20-node modules as they were deployed.<\/p>\n<p>Bernauer\u2019s team put several layers of automation in place. That cut the need for people at the co-location facility where Selene was built, a block from NVIDIA\u2019s Silicon Valley headquarters.<\/p>\n<h2><b>Slacking with a Supercomputer<\/b><\/h2>\n<p>Selene talks to staff over a Slack channel as if it were a co-worker, reporting loose cables and isolating malfunctioning hardware so the system can keep running.<\/p>\n<p>\u201cWe don\u2019t want to wake up in the night because the cluster has a problem,\u201d Bernauer said.<\/p>\n<p>It\u2019s part of the automation customers can access if they follow the guidance in the DGX POD and SuperPOD architectures.<\/p>\n<p>Thanks to this approach, the University of Florida, for example, is expected to rack and power up a 140-node extension to its HiPerGator system, switching on the most powerful AI supercomputer in academia within as little as 10 days of receiving it.<\/p>\n<p>As an added touch, the NVIDIA team bought a telepresence robot from Double Robotics so non-essential designers sheltering at home could maintain daily contact with Selene. Tongue-in-cheek, they dubbed it Trip given early concerns essential technicians on site might bump into it.<\/p>\n<p>The fact that Trip is powered by an <a href=\"https:\/\/developer.nvidia.com\/embedded\/jetson-tx2\">NVIDIA Jetson TX2<\/a> module was an added attraction for team members who imagined some day they might tinker with its programming.<\/p>\n<figure id=\"attachment_46377\" aria-describedby=\"caption-attachment-46377\" class=\"wp-caption alignleft\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped.jpg\"><picture class=\"wp-image-46377 size-large\"><source type=\"image\/webp\" srcset=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-329x500.jpg.webp 329w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-263x400.jpg.webp 263w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-768x1169.jpg.webp 768w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-1009x1536.jpg.webp 1009w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-296x450.jpg.webp 296w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-141x215.jpg.webp 141w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-66x100.jpg.webp 66w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped.jpg.webp 1200w\" sizes=\"(max-width: 329px) 100vw, 329px\"><\/source><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-329x500.jpg\" alt=\"Trip robot with Selene\" width=\"329\" height=\"500\" srcset=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-329x500.jpg 329w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-263x400.jpg 263w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-768x1169.jpg 768w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-1009x1536.jpg 1009w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-296x450.jpg 296w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-141x215.jpg 141w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped-66x100.jpg 66w, https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2020\/08\/Trip-1-cropped.jpg 1200w\" sizes=\"(max-width: 329px) 100vw, 329px\"><\/picture><\/a><figcaption id=\"caption-attachment-46377\" class=\"wp-caption-text\">Trip helped engineers inspect Selene while it was under construction.<\/figcaption><\/figure>\n<p>Since late July, Trip\u2019s been used regularly to let them virtually drive through Selene\u2019s aisles, observing the system through the robot\u2019s camera and microphone.<\/p>\n<p>\u201cTrip doesn\u2019t replace a human operator, but if you are worried about something at 2 a.m., you can check it without driving to the data center,\u201d she said.<\/p>\n<h2><b>Delivering HPC, AI Results at Scale<\/b><\/h2>\n<p>In the end, it\u2019s all about the results, and they came fast.<\/p>\n<p>In June, Selene hit No. 7 on the TOP500 list and No. 2 on the Green500 list of the most power-efficient systems. In July, <a href=\"https:\/\/blogs.nvidia.com\/blog\/2020\/07\/29\/mlperf-training-benchmark-records\/\">it broke records in all eight systems tests<\/a> for AI training performance in the latest MLPerf benchmarks.<\/p>\n<p>\u201cThe big surprise for me was how smoothly everything came up given we were using new processors and boards, and I credit all the testing along the way,\u201d said Houston. \u201cTo get this machine up and do a bunch of hard back-to-back benchmarks gave the team a huge lift,\u201d he added.<\/p>\n<p>The work pre-testing <a href=\"https:\/\/ngc.nvidia.com\/catalog\/all\">NGC<\/a> containers and HPC software for Argonne was even more gratifying. The lab is already hammering on hard problems in protein docking and quantum chemistry to shine a light on the coronavirus.<\/p>\n<p>Separately, Circe donates many of its free cycles to <a href=\"https:\/\/blogs.nvidia.com\/blog\/2020\/04\/01\/foldingathome-exaflop-coronavirus\/\">the Folding@Home initiative that fights COVID<\/a>.<\/p>\n<p>At the same time, NVIDIA\u2019s own researchers are using Selene to <a href=\"https:\/\/www.nvidia.com\/en-us\/self-driving-cars\/data-center\/\">train autonomous vehicles<\/a> and refine <a href=\"https:\/\/developer.nvidia.com\/conversational-ai\">conversational AI<\/a>, nearing advances they\u2019re expected to report soon. They are among more than a thousand jobs run, often simultaneously, on the system so far.<\/p>\n<p>Meanwhile the team already has on the whiteboard ideas for what\u2019s next. \u201cGive performance-obsessed engineers enough horsepower and cables and they will figure out amazing things,\u201d said Bernauer.<\/p>\n<p><em>At top: An artist\u2019s rendering of a portion of Selene.<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>http:\/\/feedproxy.google.com\/~r\/nvidiablog\/~3\/o3EAc_lSukU\/<\/p>\n","protected":false},"author":1,"featured_media":22,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/21"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=21"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/21\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/22"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=21"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=21"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=21"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}