{"id":2850,"date":"2023-01-23T09:11:30","date_gmt":"2023-01-23T09:11:30","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2023\/01\/23\/booked-for-brilliance-swedens-national-library-turns-page-to-ai-to-parse-centuries-of-data\/"},"modified":"2023-01-23T09:11:30","modified_gmt":"2023-01-23T09:11:30","slug":"booked-for-brilliance-swedens-national-library-turns-page-to-ai-to-parse-centuries-of-data","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2023\/01\/23\/booked-for-brilliance-swedens-national-library-turns-page-to-ai-to-parse-centuries-of-data\/","title":{"rendered":"Booked for Brilliance: Sweden\u2019s National Library Turns Page to AI to Parse Centuries of Data"},"content":{"rendered":"<div data-url=\"https:\/\/blogs.nvidia.com\/blog\/2023\/01\/23\/sweden-library-ai-open-source\/\" data-title=\"Booked for Brilliance: Sweden\u2019s National Library Turns Page to AI to Parse Centuries of Data\" data-hashtags=\"\">\n<p>For the past 500 years, the National Library of Sweden has collected virtually every word published in Swedish, from priceless medieval manuscripts to present-day pizza menus.<\/p>\n<p>Thanks to a centuries-old law that requires a copy of everything published in Swedish to be submitted to the library \u2014 also known as Kungliga biblioteket, or KB \u2014 its collections span from the obvious to the obscure: books, newspapers, radio and TV broadcasts, internet content, Ph.D. dissertations, postcards, menus and video games. It\u2019s a wildly diverse collection of nearly 26 petabytes of data, ideal for training state-of-the-art AI.<\/p>\n<p>\u201cWe can build state-of-the-art AI models for the Swedish language since we have the best data,\u201d said Love B\u00f6rjeson, director of KBLab, the library\u2019s data lab.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-61969 size-large\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/01\/KB-vinter-fasad-672x376.jpg\" alt=\"\" width=\"672\" height=\"376\"><\/p>\n<p>Using <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-systems\/\">NVIDIA DGX systems<\/a>, the group has developed more than two dozen open-source <a href=\"https:\/\/blogs.nvidia.com\/blog\/2022\/03\/25\/what-is-a-transformer-model\/\">transformer<\/a> models, <a href=\"https:\/\/huggingface.co\/KBLab\" target=\"_blank\" rel=\"noopener\">available on Hugging Face<\/a>. The models, downloaded by up to 200,000 developers per month, enable research at the library and other academic institutions.<\/p>\n<p>\u201cBefore our lab was created, researchers couldn\u2019t access a dataset at the library \u2014 they\u2019d have to look at a single object at a time,\u201d B\u00f6rjeson said. \u201cThere was a need for the library to create datasets that enabled researchers to conduct quantity-oriented research.\u201d<\/p>\n<p>With this, researchers will soon be able to create hyper-specialized datasets \u2014 for example, pulling up every Swedish postcard that depicts a church, every text written in a particular style or every mention of a historical figure across books, newspaper articles and TV broadcasts.<\/p>\n<h2><b>Turning Library Archives Into AI Training Data<\/b><\/h2>\n<p>The library\u2019s datasets represent the full diversity of the Swedish language \u2014 including its formal and informal variations, regional dialects and changes over time.<\/p>\n<p>\u201cOur inflow is continuous and growing \u2014 every month, we see more than 50 terabytes of new data,\u201d said B\u00f6rjeson. \u201cBetween the exponential growth of digital data and ongoing work digitizing physical collections that date back hundreds of years, we\u2019ll never be finished adding to our collections.\u201d<\/p>\n<figure id=\"attachment_61963\" aria-describedby=\"caption-attachment-61963\" class=\"wp-caption aligncenter\"><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-61963\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/01\/multimodal-data-672x161.jpg\" alt=\"\" width=\"672\" height=\"161\"><figcaption id=\"caption-attachment-61963\" class=\"wp-caption-text\">The library\u2019s archives include audio, text and video.<\/figcaption><\/figure>\n<p>Soon after KBLab was established in 2019, B\u00f6rjeson saw the potential for training transformer language models on the library\u2019s vast archives. He was inspired by an early, multilingual, natural language processing model by Google that included 5GB of Swedish text.<\/p>\n<p>KBLab\u2019s first model used 4x as much \u2014 and the team now aims to train its models on at least a terabyte of Swedish text. The lab began experimenting by adding Dutch, German and Norwegian content to its datasets after finding that a multilingual dataset may improve the AI\u2019s performance.<\/p>\n<h2><b>NVIDIA AI, GPUs Accelerate Model Development\u00a0<\/b><\/h2>\n<p>The lab started out using consumer-grade NVIDIA GPUs, but B\u00f6rjeson soon discovered his team needed data-center-scale compute to train larger models.<\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"alignleft size-large wp-image-61966\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/01\/DSC3002-334x500.jpg\" alt=\"\" width=\"334\" height=\"500\">\u201cWe realized we can\u2019t keep up if we try to do this on small workstations,\u201d said B\u00f6rjeson. \u201cIt was a no-brainer to go for NVIDIA DGX. There\u2019s a lot we wouldn\u2019t be able to do at all without the DGX systems.\u201d<\/p>\n<p>The lab has two <a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/dgx-station-a100\/\">NVIDIA DGX systems<\/a> from Swedish provider <a href=\"https:\/\/www.addpro.se\/\" target=\"_blank\" rel=\"noopener\">AddPro<\/a> for on-premises AI development. The systems are used to handle sensitive data, conduct large-scale experiments and fine-tune models. They\u2019re also used to prepare for even larger runs on massive, GPU-based <a href=\"https:\/\/blogs.nvidia.com\/blog\/2020\/10\/15\/supercomputing-ai-eurohpc\/\">supercomputers across the European Union<\/a> \u2014 including the <a href=\"https:\/\/enccs.se\/news\/2022\/10\/national-library-of-sweden-accesses-meluxina\/\" target=\"_blank\" rel=\"noopener\">MeluXina system in Luxembourg<\/a>.<\/p>\n<p>\u201cOur work on the DGX systems is critically important, because once we\u2019re in a high-performance computing environment, we want to hit the ground running,\u201d said B\u00f6rjeson. \u201cWe have to use the supercomputer to its fullest extent.\u201d<\/p>\n<p>The team has also adopted <a href=\"https:\/\/developer.nvidia.com\/nemo\/megatron\">NVIDIA NeMo Megatron<\/a>, a PyTorch-based framework for training large language models, with <a href=\"https:\/\/developer.nvidia.com\/cuda-toolkit\">NVIDIA CUDA<\/a> and the <a href=\"https:\/\/developer.nvidia.com\/nccl\">NVIDIA NCCL<\/a> library under the hood to optimize GPU usage in multi-node systems.<\/p>\n<p>\u201cWe rely to a large extent on the NVIDIA frameworks,\u201d B\u00f6rjeson said. \u201cIt\u2019s one of the big advantages of NVIDIA for us, as a small lab that doesn\u2019t have 50 engineers available to optimize AI training for every project.\u201d<\/p>\n<h2><b>Harnessing Multimodal Data for Humanities Research<\/b><\/h2>\n<p>In addition to transformer models that understand Swedish text, KBLab has an AI tool that transcribes sound to text, enabling the library to transcribe its vast collection of radio broadcasts so that researchers can search the audio records for specific content.<\/p>\n<figure id=\"attachment_61972\" aria-describedby=\"caption-attachment-61972\" class=\"wp-caption alignright\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-61972 size-medium\" src=\"https:\/\/blogs.nvidia.com\/wp-content\/uploads\/2023\/01\/Kortkataloger-400x269.jpg\" alt=\"\" width=\"400\" height=\"269\"><figcaption id=\"caption-attachment-61972\" class=\"wp-caption-text\">AI-enhanced databases are the latest evolution of library records, which were long stored in physical card catalogs.<\/figcaption><\/figure>\n<p>KBLab is also starting to develop generative text models and is working on an AI model that could process videos and create automatic descriptions of their content.<\/p>\n<p>\u201cWe also want to link all the different modalities,\u201d B\u00f6rjeson said. \u201cWhen you search the library\u2019s databases for a specific term, we should be able to return results that include text, audio and video.\u201d<\/p>\n<p>KBLab has partnered with researchers at the University of Gothenburg, who are developing downstream apps using the lab\u2019s models to conduct linguistic research \u2014 including a project supporting the Swedish Academy\u2019s work to modernize its data-driven techniques for creating Swedish dictionaries.<\/p>\n<p>\u201cThe societal benefits of these models are much larger than we initially expected,\u201d B\u00f6rjeson said.<\/p>\n<p><em>Images courtesy of Kungliga biblioteket<\/em><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blogs.nvidia.com\/blog\/2023\/01\/23\/sweden-library-ai-open-source\/<\/p>\n","protected":false},"author":0,"featured_media":2851,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2850"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=2850"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/2850\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/2851"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=2850"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=2850"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=2850"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}