{"id":2159,"date":"2022-06-14T15:42:24","date_gmt":"2022-06-14T15:42:24","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/06\/14\/the-data-centers-traffic-cop-ai-clears-digital-gridlock\/"},"modified":"2022-06-14T15:42:24","modified_gmt":"2022-06-14T15:42:24","slug":"the-data-centers-traffic-cop-ai-clears-digital-gridlock","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/06\/14\/the-data-centers-traffic-cop-ai-clears-digital-gridlock\/","title":{"rendered":"The Data Center\u2019s Traffic Cop: AI Clears Digital Gridlock"},"content":{"rendered":"<div data-url=\"https:\/\/blogs.nvidia.com\/blog\/2022\/06\/14\/ai-network-congestion-control\/\" data-title=\"The Data Center\u2019s Traffic Cop: AI Clears Digital Gridlock\" data-hashtags=\"\">\n<p>Gal Dalal wants to ease the commute for those who work from home \u2014 or the office.<\/p>\n<p>The senior research scientist at NVIDIA, who is part of a 10-person lab in Israel, is using AI to reduce congestion on computer networks.<\/p>\n<p>For laptop jockeys, a spinning circle of death \u2014 or worse, a frozen cursor \u2014 is as bad as a sea of red lights on the highway. Like rush hour, it\u2019s caused by a flood of travelers angling to get somewhere fast, crowding and sometimes colliding on the way.<\/p>\n<h2><b>AI at the Intersection<\/b><\/h2>\n<p>Networks use congestion control to manage digital traffic. It\u2019s basically a set of rules embedded into network adapters and switches, but as the number of users on networks grows their conflicts can become too complex to anticipate.<\/p>\n<p>AI promises to be a better traffic cop because it can see and respond to patterns as they develop. That\u2019s why Dalal is among many researchers around the world looking for ways to make networks smarter with reinforcement learning, a type of AI that rewards models when they find good solutions.<\/p>\n<p>But until now, no one\u2019s come up with a practical approach for several reasons.<\/p>\n<h2><b>Racing the Clock<\/b><\/h2>\n<p>Networks need to be both fast and fair so no request gets left behind. That\u2019s a tough balancing act when no one driver on the digital road can see the entire, ever-changing map of other drivers and their intended destinations.<\/p>\n<p>And it\u2019s a race against the clock. To be effective, networks need to respond to situations in about a microsecond, that\u2019s one-millionth of a second.<\/p>\n<p>To smooth traffic, the NVIDIA team created new\u00a0 reinforcement learning techniques inspired by state-of-the-art computer game AI and adapted them to the networking problem.<\/p>\n<p>Part of their breakthrough, described in a <a href=\"https:\/\/arxiv.org\/pdf\/2102.09337.pdf\">2021 paper<\/a>, was coming up with an algorithm and a corresponding reward function for a balanced network based only on local information available to individual network streams. The algorithm enabled the team to create, train and run an AI model on their <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-systems\/\">NVIDIA DGX system<\/a>.<\/p>\n<h2><b>A Wow Factor<\/b><\/h2>\n<p>Dalal recalls the meeting where a fellow Nvidian, Chen Tessler, showed the first chart plotting the model\u2019s results on a simulated InfiniBand data center network.<\/p>\n<p>\u201cWe were like, wow, ok, it works very nicely,\u201d said Dalal, who wrote his Ph.D. thesis on reinforcement learning at Technion, Israel\u2019s prestigious technical university.<\/p>\n<p>\u201cWhat was especially gratifying was we trained the model on just 32 network flows, and it nicely generalized what it learned to manage more than 8,000 flows with all sorts of intricate situations, so the machine was doing a much better job than preset rules,\u201d he added.<\/p>\n<figure id=\"attachment_57650\" aria-describedby=\"caption-attachment-57650\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2022\/06\/RL-CC-results.jpg\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2022\/06\/RL-CC-results-672x432.jpg\" alt=\"Reinforcement learning for congestion control\" width=\"672\" height=\"432\"><\/p>\n<p><\/a><figcaption id=\"caption-attachment-57650\" class=\"wp-caption-text\">Reinforcement learning (purple) outperformed all rule-based congestion control algorithms in NVIDIA\u2019s tests.<\/figcaption><\/figure>\n<p>In fact, the algorithm delivered at least 1.5x better throughput and 4x lower latency than the best rule-based technique.<\/p>\n<p>Since the paper\u2019s release, the work\u2019s won praise as a real-world application that shows the potential of reinforcement learning.<\/p>\n<h2><b>Processing AI in the Network<\/b><\/h2>\n<p>The next big step, still a work in progress, is to design a version of the AI model that can run at microsecond speeds using the limited compute and memory resources in the network. Dalal described two paths forward.<\/p>\n<p>His team is collaborating with the engineers designing <a href=\"https:\/\/www.nvidia.com\/en-us\/networking\/products\/data-processing-unit\/\">NVIDIA BlueField DPUs<\/a> to optimize the AI models for future hardware. BlueField DPUs aim to run inside the network an expanding set of communications jobs, offloading tasks from overburdened CPUs.<\/p>\n<p>Separately, Dalal\u2019s team is distilling the essence of its AI model into a machine learning technique called boosting trees, a series of yes\/no decisions that\u2019s nearly as smart but much simpler to run. The team aims to present its work later this year in a form that could be immediately adopted to ease network traffic.<\/p>\n<h2><b>A Timely Traffic Solution<\/b><\/h2>\n<p>To date, Dalal has applied reinforcement learning to everything from autonomous vehicles to data center cooling and chip design. When <a href=\"https:\/\/nvidianews.nvidia.com\/news\/nvidia-completes-acquisition-of-mellanox-creating-major-force-driving-next-gen-data-centers\">NVIDIA acquired Mellanox<\/a> in April 2020, the NVIDIA Israel researcher started collaborating with his new colleagues in the nearby networking group.<\/p>\n<p>\u201cIt made sense to apply our AI algorithms to the work of their congestion control teams, and now, two years later, the research is more mature,\u201d he said.<\/p>\n<p>It\u2019s good timing. Recent reports of double-digit increases in Israel\u2019s car traffic since pre-pandemic times could encourage more people to work from home, driving up network congestion.<\/p>\n<p>Luckily, an AI traffic cop is on the way.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blogs.nvidia.com\/blog\/2022\/06\/14\/ai-network-congestion-control\/<\/p>\n","protected":false},"author":0,"featured_media":2160,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2159"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=2159"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2159\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/2160"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=2159"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=2159"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=2159"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}