{"id":1954,"date":"2022-03-10T18:45:55","date_gmt":"2022-03-10T18:45:55","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/10\/load-and-transform-data-from-delta-lake-using-amazon-sagemaker-studio-and-apache-spark\/"},"modified":"2022-03-10T18:45:55","modified_gmt":"2022-03-10T18:45:55","slug":"load-and-transform-data-from-delta-lake-using-amazon-sagemaker-studio-and-apache-spark","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/10\/load-and-transform-data-from-delta-lake-using-amazon-sagemaker-studio-and-apache-spark\/","title":{"rendered":"Load and transform data from Delta Lake using Amazon SageMaker Studio and Apache Spark"},"content":{"rendered":"<div id=\"\">\n<p>Data lakes have become the norm in the industry for storing critical business data. The primary rationale for a data lake is to land all types of data, from raw data to preprocessed and postprocessed data, and may include both structured and unstructured data formats. Having a centralized data store for all types of data allows modern big data applications to load, transform, and process whatever type of data is needed. Benefits include storing data as is without the need to first structure or transform it. Most importantly, data lakes allow controlled access to data from many different types of analytics and machine learning (ML) processes in order to guide better decision-making.<\/p>\n<p>Multiple vendors have created data lake architectures, including <a href=\"https:\/\/aws.amazon.com\/lake-formation\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Lake Formation<\/a>. In addition, open-source solutions allow companies to access, load, and share data easily. One of the options for storing data in the AWS Cloud is <a href=\"https:\/\/docs.delta.io\/latest\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">Delta Lake<\/a>. The Delta Lake library enables reads and writes in open-source <a href=\"https:\/\/parquet.apache.org\/documentation\/latest\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Parquet<\/a> file format, and provides capabilities like ACID transactions, scalable metadata handling, and unified streaming and batch data processing. Delta Lake offers a storage layer API that you can use to store data on top of an object-layer storage like <a href=\"https:\/\/aws.amazon.com\/s3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3).<\/p>\n<p>Data is at the heart of ML\u2014training a traditional supervised model is impossible without access to high-quality historical data, which is commonly stored in a data lake. <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> is a fully managed service that provides a versatile workbench for building ML solutions and provides highly tailored tooling for data ingestion, data processing, model training, and model hosting. <a href=\"https:\/\/spark.apache.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Apache Spark<\/a> is a workhorse of modern data processing with an extensive API for loading and manipulating data. SageMaker has the ability to prepare data at petabyte scale using Spark to enable ML workflows that run in a highly distributed manner. This post highlights how you can take advantage of the capabilities offered by Delta Lake using <a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/studio.html\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker Studio<\/a>.<\/p>\n<h2>Solution overview<\/h2>\n<p>In this post, we describe how to use SageMaker Studio notebooks to easily load and transform data stored in the Delta Lake format. We use a standard Jupyter notebook to run Apache Spark commands that read and write table data in CSV and Parquet format. The open-source library <a href=\"https:\/\/github.com\/delta-io\/delta\" target=\"_blank\" rel=\"noopener noreferrer\">delta-spark<\/a> allows you to directly access this data in its native format. This library allows you to take advantage of the many API operations to apply data transformations, make schema modifications, and use time-travel or as-of-timestamp queries to pull a particular version of the data.<\/p>\n<p>In our sample notebook, we load raw data into a Spark DataFrame, create a Delta table, query it, display audit history, demonstrate schema evolution, and show various methods for updating the table data. We use the <a href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/getting_started\/quickstart_df.html\" target=\"_blank\" rel=\"noopener noreferrer\">DataFrame API<\/a> from the PySpark library to ingest and transform the dataset attributes. We use the <code>delta-spark<\/code> library to read and write data in Delta Lake format and to manipulate the underlying table structure, referred to as the <em>schema<\/em>.<\/p>\n<p>We use SageMaker Studio, the built-in IDE from SageMaker, to create and run Python code from a Jupyter notebook. We have created a <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-delta-lake-integration\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub repository<\/a> that contains <a href=\"https:\/\/github.com\/aws-samples\/amazon-sagemaker-delta-lake-integration\/blob\/main\/notebooks\/studio_pyspark_local_delta_spark.ipynb\" target=\"_blank\" rel=\"noopener noreferrer\">this notebook<\/a> and other resources to run this sample on your own. The notebook demonstrates exactly how to interact with data stored in Delta Lake format, which permits tables to be accessed in-place without the need to replicate data across disparate datastores.<\/p>\n<p>For this example, we use a publicly available dataset from <a href=\"https:\/\/www.kaggle.com\/wordsforthewise\/lending-club\" target=\"_blank\" rel=\"noopener noreferrer\">Lending Club<\/a> that represents customer loans data. We downloaded the <code>accepted<\/code> data file (<code>accepted_2007_to_2018Q4.csv.gz<\/code>), and selected a subset of the original attributes. This dataset is available under the <a href=\"https:\/\/creativecommons.org\/publicdomain\/zero\/1.0\/\" target=\"_blank\" rel=\"noopener noreferrer\">Creative Commons (CCO) License<\/a>.<\/p>\n<h2>Prerequisites<\/h2>\n<p>You must install a few prerequisites prior to using the <code>delta-spark<\/code> functionality. To satisfy required dependencies, we have to install some libraries into our Studio environment, which runs as a Dockerized container and is accessed via a Jupyter Gateway app:<\/p>\n<ul>\n<li>OpenJDK for access to Java and associated libraries<\/li>\n<li>PySpark (Spark for Python) library<\/li>\n<li>Delta Spark open-source library<\/li>\n<\/ul>\n<p>We can use either <code>conda<\/code> or <code>pip<\/code> to install these libraries, which are publicly available in either <code>conda-forge<\/code>, PyPI servers, or Maven repositories.<\/p>\n<p>This notebook is designed to run within SageMaker Studio. After you launch the notebook within Studio, make sure you choose the <strong>Python 3(Data Science)<\/strong> kernel type. Additionally, we suggest using an instance type with at least 16 GB of RAM (like ml.g4dn.xlarge), which allows PySpark commands to run faster. Use the following commands to install the required dependencies, which make up the first several cells of the notebook:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">%conda install openjdk -q -y\n%pip install pyspark==3.2.0\n%pip install delta-spark==1.1.0\n%pip install -U \"sagemaker&gt;2.72\"<\/code><\/pre>\n<\/p><\/div>\n<p>After the installation commands are complete, we\u2019re ready to run the core logic in the notebook.<\/p>\n<h2>Implement the solution<\/h2>\n<p>To run Apache Spark commands, we need to instantiate a <code>SparkSession<\/code> object. After we include the necessary import commands, we configure the <code>SparkSession<\/code> by setting some additional configuration parameters. The parameter with key <code>spark.jars.packages<\/code> passes the names of additional libraries used by Spark to run <code>delta<\/code> commands. The following initial lines of code assemble a list of packages, using traditional Maven coordinates (<code>groupId:artifactId:version<\/code>), to pass these additional packages to the <code>SparkSession<\/code>.<\/p>\n<p>Additionally, the parameters with key <code>spark.sql.extensions<\/code> and <code>spark.sql.catalog.spark_catalog<\/code> enable Spark to properly handle Delta Lake functionality. The final configuration parameter with key <code>fs.s3a.aws.credentials.provider<\/code> adds the <code>ContainerCredentialsProvider<\/code> class, which allows Studio to look up the <a href=\"http:\/\/aws.amazon.com\/iam\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Identity and Access Management<\/a> (IAM) role permissions made available via the container environment. The code creates a <code>SparkSession<\/code> object that is properly initialized for the SageMaker Studio environment:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Configure Spark to use additional library packages to satisfy dependencies\n\n# Build list of packages entries using Maven coordinates (groupId:artifactId:version)\npkg_list = []\npkg_list.append(\"io.delta:delta-core_2.12:1.1.0\")\npkg_list.append(\"org.apache.hadoop:hadoop-aws:3.2.2\")\n\npackages=(\",\".join(pkg_list))\nprint('packages: '+packages)\n\n# Instantiate Spark via builder\n# Note: we use the `ContainerCredentialsProvider` to give us access to underlying IAM role permissions\n\nspark = (SparkSession\n    .builder\n    .appName(\"PySparkApp\") \n    .config(\"spark.jars.packages\", packages) \n    .config(\"spark.sql.extensions\", \"io.delta.sql.DeltaSparkSessionExtension\") \n    .config(\"spark.sql.catalog.spark_catalog\", \"org.apache.spark.sql.delta.catalog.DeltaCatalog\") \n    .config(\"fs.s3a.aws.credentials.provider\", \n\"com.amazonaws.auth.ContainerCredentialsProvider\") \n    .getOrCreate()) \n\nsc = spark.sparkContext\n\nprint('Spark version: '+str(sc.version))<\/code><\/pre>\n<\/p><\/div>\n<p>In the next section, we upload a file containing a subset of the Lending Club consumer loans dataset to our default S3 bucket. The original dataset is very large (over 600 MB), so we provide a single representative file (2.6 MB) for use by this notebook. PySpark uses the <code>s3a<\/code> protocol to enable additional Hadoop library functionality. Therefore, we modify each native S3 URI from the <code>s3<\/code> protocol to use <code>s3a<\/code> in the cells throughout this notebook.<\/p>\n<p>We use Spark to read in the raw data (with options for both CSV or Parquet files) with the following code, which returns a Spark DataFrame named <code>loans_df<\/code>:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">loans_df = spark.read.csv(s3a_raw_csv, header=True)<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows the first 10 rows from the resulting DataFrame.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image001.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33880\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image001.png\" alt=\"\" width=\"1074\" height=\"539\"><\/a><\/p>\n<p>We can now write out this DataFrame as a Delta Lake table with a single line of code by specifying <code>.format(\"delta\")<\/code> and providing the S3 URI location where we want to write the table data:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">loans_df.write.format(\"delta\").mode(\"overwrite\").save(s3a_delta_table_uri)<\/code><\/pre>\n<\/p><\/div>\n<p>The next few notebook cells show an option for querying the Delta Lake table. We can construct a standard SQL query, specify <code>delta<\/code> format and the table location, and submit this command using Spark SQL syntax:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-sql\">sql_cmd = f'SELECT * FROM delta.`{s3a_delta_table_uri}` ORDER BY loan_amnt'\nsql_results = spark.sql(sql_cmd)<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows the results of our SQL query as ordered by <code>loan_amnt<\/code>.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image003.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33881\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image003.png\" alt=\"\" width=\"1061\" height=\"486\"><\/a><\/p>\n<h2>Interact with Delta Lake tables<\/h2>\n<p>In this section, we showcase the <a href=\"https:\/\/docs.delta.io\/latest\/api\/python\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">DeltaTable class<\/a> from the <code>delta-spark<\/code> library. <code>DeltaTable<\/code> is the primary class for programmatically interacting with Delta Lake tables. This class includes several static methods for discovering information about a table. For example, the <code>isDeltaTable<\/code> method returns a Boolean value indicating whether the table is stored in delta format:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Use static method to determine table type\nprint(DeltaTable.isDeltaTable(spark, s3a_delta_table_uri))<\/code><\/pre>\n<\/p><\/div>\n<p>You can create <code>DeltaTable<\/code> instances using the path of the Delta table, which in our case is the S3 URI location. In the following code, we retrieve the complete history of table modifications:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">deltaTable = DeltaTable.forPath(spark, s3a_delta_table_uri)\nhistory_df = deltaTable.history()\nhistory_df.head(3)<\/code><\/pre>\n<\/p><\/div>\n<p>The output indicates the table has six modifications stored in the history, and shows the latest three versions.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image005.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33882\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image005.png\" alt=\"\" width=\"1549\" height=\"474\"><\/a><\/p>\n<h2>Schema evolution<\/h2>\n<p>In this section, we demonstrate how Delta Lake schema evolution works. By default, <code>delta-spark<\/code> forces table writes to abide by the existing schema by enforcing constraints. However, by specifying certain options, we can safely modify the schema of the table.<\/p>\n<p>First, let\u2019s read data back in from the Delta table. Because this data was written out as <code>delta<\/code> format, we need to specify <code>.format(\"delta\")<\/code> when reading the data, then we provide the S3 URI where the Delta table is located. Second, we write the DataFrame back out to a different S3 location where we demonstrate schema evolution. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">delta_df = (spark.read.format(\"delta\").load(s3a_delta_table_uri))\ndelta_df.write.format(\"delta\").mode(\"overwrite\").save(s3a_delta_update_uri)<\/code><\/pre>\n<\/p><\/div>\n<p>Now we use the Spark DataFrame API to add two new columns to our dataset. The column names are <code>funding_type<\/code> and <code>excess_int_rate<\/code>, and the column values are set to constants using the DataFrame <code>withColumn<\/code> method. See the following code:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">funding_type_col = \"funding_type\"\nexcess_int_rate_col = \"excess_int_rate\"\n\ndelta_update_df = (delta_df.withColumn(funding_type_col, lit(\"NA\"))\n                           .withColumn(excess_int_rate_col, lit(0.0)))\ndelta_update_df.dtypes<\/code><\/pre>\n<\/p><\/div>\n<p>A quick look at the data types (<code>dtypes<\/code>) shows the additional columns are part of the DataFrame.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image007.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33883\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image007.png\" alt=\"\" width=\"503\" height=\"288\"><\/a><\/p>\n<p>Now we enable the schema modification, thereby changing the underlying schema of the Delta table, by setting the <code>mergeSchema<\/code> option to <code>true<\/code> in the following Spark write command:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">(delta_update_df.write.format(\"delta\")\n .mode(\"overwrite\")\n .option(\"mergeSchema\", \"true\") # option - evolve schema\n .save(s3a_delta_update_uri)\n)<\/code><\/pre>\n<\/p><\/div>\n<p>Let\u2019s check the modification history of our new table, which shows that the table schema has been modified:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">deltaTableUpdate = DeltaTable.forPath(spark, s3a_delta_update_uri)\n\n# Let's retrieve history BEFORE schema modification\nhistory_update_df = deltaTableUpdate.history()\nhistory_update_df.show(3)<\/code><\/pre>\n<\/p><\/div>\n<p>The history listing shows each revision to the metadata.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image009.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33884\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image009.png\" alt=\"\" width=\"1557\" height=\"449\"><\/a><\/p>\n<h2>Conditional table updates<\/h2>\n<p>You can use the <code>DeltaTable update<\/code> method to run a predicate and then apply a transform whenever the condition evaluates to <code>True<\/code>. In our case, we write the value <code>FullyFunded<\/code> to the <code>funding_type<\/code> column whenever the <code>loan_amnt<\/code> equals the <code>funded_amnt<\/code>. This is a powerful mechanism for writing conditional updates to your table data.<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\">deltaTableUpdate.update(condition = col(\"loan_amnt\") == col(\"funded_amnt\"),\n    set = { funding_type_col: lit(\"FullyFunded\") } )<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows our results.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image011.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33885\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image011.png\" alt=\"\" width=\"1449\" height=\"306\"><\/a><\/p>\n<p>In the final change to the table data, we show the syntax to pass a function to the update method, which in our case calculates the excess interest rate by subtracting 10.0% from the loan\u2019s <code>int_rate<\/code> attribute. One more SQL command pulls records that meet our condition, using the WHERE clause to locate records with <code>int_rate<\/code> greater than 10.0%:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Function that calculates rate overage (amount over 10.0)\ndef excess_int_rate(rate):\n    return (rate-10.0)\n\ndeltaTableUpdate.update(condition = col(\"int_rate\") &gt; 10.0,\n set = { excess_int_rate_col: excess_int_rate(col(\"int_rate\")) } )<\/code><\/pre>\n<\/p><\/div>\n<p>The new <code>excess_int_rate<\/code> column now correctly contains the <code>int_rate<\/code> minus 10.0%.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image013.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33886\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image013.png\" alt=\"\" width=\"1477\" height=\"313\"><\/a><\/p>\n<p>Our last notebook cell retrieves the history from the Delta table again, this time showing the modifications after the schema modification has been performed:<\/p>\n<div class=\"hide-language\">\n<pre><code class=\"lang-python\"># Finally, let's retrieve table history AFTER the schema modifications\n\nhistory_update_df = deltaTableUpdate.history()\nhistory_update_df.show(3)<\/code><\/pre>\n<\/p><\/div>\n<p>The following screenshot shows our results.<br \/><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image015.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33887\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/ML-7249-image015.png\" alt=\"\" width=\"1555\" height=\"470\"><\/a><\/p>\n<h2>Conclusion<\/h2>\n<p>You can use SageMaker Studio notebooks to interact directly with data stored in the open-source Delta Lake format. In this post, we provided sample code that reads and writes this data using the open source <code>delta-spark<\/code> library, which allows you to create, update, and manage the dataset as a <a href=\"https:\/\/docs.delta.io\/latest\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">Delta table<\/a>. We also demonstrated the power of combining these critical technologies to extract value from preexisting data lakes, and showed how to use the capabilities of Delta Lake on SageMaker.<\/p>\n<p>Our notebook sample provides an end-to-end recipe for installing prerequisites, instantiating Spark data structures, reading and writing DataFrames in Delta Lake format, and using functionalities like schema evolution. You can integrate these technologies to magnify their power to provide transformative business outcomes.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-25503 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2021\/06\/22\/Paul-Hargis.jpg\" alt=\"\" width=\"100\" height=\"129\"><\/strong><strong>Paul Hargis<\/strong> has focused his efforts on Machine Learning at several companies, including AWS, Amazon, and Hortonworks. He enjoys building technology solutions and also teaching people how to make the most of it. Prior to his role at AWS, he was lead architect for Amazon Exports and Expansions helping amazon.com improve experience for international shoppers. Paul likes to help customers expand their machine learning initiatives to solve real-world problems.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/vedjain.png\"><img decoding=\"async\" loading=\"lazy\" class=\"size-full wp-image-33892 alignleft\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/03\/08\/vedjain.png\" alt=\"\" width=\"100\" height=\"133\"><\/a><strong>Vedant Jain <\/strong>is a Sr. AI\/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML\/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) &amp; JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life &amp; exploring delicious vegetarian cuisine from around the world.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/load-and-transform-data-from-delta-lake-using-amazon-sagemaker-studio-and-apache-spark\/<\/p>\n","protected":false},"author":0,"featured_media":1955,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1954"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1954"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1954\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1955"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1954"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1954"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1954"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}