{"id":1942,"date":"2022-03-09T19:56:52","date_gmt":"2022-03-09T19:56:52","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/09\/predict-residential-real-estate-prices-at-immoscout24-with-amazon-sagemaker\/"},"modified":"2022-03-09T19:56:52","modified_gmt":"2022-03-09T19:56:52","slug":"predict-residential-real-estate-prices-at-immoscout24-with-amazon-sagemaker","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2022\/03\/09\/predict-residential-real-estate-prices-at-immoscout24-with-amazon-sagemaker\/","title":{"rendered":"Predict residential real estate prices at ImmoScout24 with Amazon SageMaker"},"content":{"rendered":"<div id=\"\">\n<p><em>This is a guest post by Oliver Frost, data scientist at ImmoScout24, in partnership with Lukas M\u00fcller, AWS Solutions Architect.<\/em><\/p>\n<p>In 2010, <a href=\"https:\/\/www.immobilienscout24.de\/\" target=\"_blank\" rel=\"noopener noreferrer\">ImmoScout24<\/a> released a price index for residential real estate in Germany: the IMX. It was based on ImmoScout24 listings. Besides the price, listings typically contain a lot of specific information such as the construction year, the plot size, or the number of rooms. This information allowed us to build a so-called hedonic price index, which considers the particular features of a real estate property.<\/p>\n<p>When we released the IMX, our goal was to establish it as the standard index for real estate prices in Germany. However, it struggled to capture the price increase in the German property market since the financial crisis of 2008. In addition, like a stock market index, it was an abstract figure that can\u2019t be interpreted directly. The IMX was therefore difficult to grasp for non-experts.<\/p>\n<p>At ImmoScout24, our mission is to make complex decisions easy, and we realized that we needed a new concept to fulfill it. Instead of another index, we decided to build a market report that everyone can easily understand: the WohnBarometer. It\u2019s based on our listings data and takes object properties into account. The key difference from the IMX is that the WohnBarometer shows rent and sale prices in Euro per square meter for specific residential real estate types over time. The figures therefore can be directly interpreted and allow our customers to answer questions such as \u201cDo I pay too much rent?\u201d or \u201cIs the apartment I am about to buy reasonably priced?\u201d or \u201cWhich city in my region is the most promising one for investing?\u201d Currently, the WohnBarometer is reported for Germany as a whole, the seven biggest cities, and alternating local markets.<\/p>\n<p>The following graph shows an example of the WohnBarometer, with sale prices for Berlin and the development per quarter.<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/ML-4143-image001-500.png\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-33038 size-full aligncenter\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/ML-4143-image001-500.png\" alt=\"\" width=\"500\" height=\"500\"><\/a><\/p>\n<p>This post discusses how ImmoScout24 used <a href=\"https:\/\/aws.amazon.com\/sagemaker\/\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon SageMaker<\/a> to create the model for the WohnBarometer in order to make it relevant for our customers. It discusses the underlying data model, hyperparameter tuning, and technical setup. This post also shows how SageMaker supported one data scientist to complete the WohnBarometer within 2 months. It took a whole team 2 years to develop the first version of the IMX. Such an investment was not an option for the WohnBarometer.<\/p>\n<h2>About ImmoScout24<\/h2>\n<p>ImmoScout24 is the leading online platform for residential and commercial real estate in Germany. For over 20 years, ImmoScout24 has been revolutionizing the real estate market and supports over 20 million users each month on its online marketplace or in its app to find new homes or commercial spaces. That\u2019s why 99% of our target customer group know ImmoScout24. With its digital solutions, the online marketplace coordinates and brings owners, realtors, tenants, and buyers together successfully. ImmoScout24 is working towards the goal of digitizing the process of real estate transactions and thereby making complex decisions easy. Since 2012, ImmoScout24 has also been active in the Austrian real estate market, reaching around 3 million users monthly.<\/p>\n<h2>From on-premises to AWS Data Pipeline to SageMaker<\/h2>\n<p>In this section, we discuss the previous setup and its challenges, and why we decided to use SageMaker for our new model.<\/p>\n<h3>The previous setup<\/h3>\n<p>When the first version of the IMX was published in 2010, the cloud was still a mystery to most businesses, including ImmoScout24. The field of machine learning (ML) was in its infancy and only a handful of experts knew how to code a model (for the sake of illustration, the first public release of Scikit-Learn was in February 2010). It\u2019s no surprise that the development of the IMX took more than 2 years and cost a seven-figure sum.<\/p>\n<p>In 2015, ImmoScout24 started its AWS migration, and rebuilt IMX on AWS infrastructure. With the data in our <a href=\"http:\/\/aws.amazon.com\/s3\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon Simple Storage Service<\/a> (Amazon S3) data lake, both the data preprocessing and the model training were now done on <a href=\"http:\/\/aws.amazon.com\/emr\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon EMR<\/a> clusters orchestrated by <a href=\"https:\/\/aws.amazon.com\/datapipeline\/\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Data Pipeline<\/a>. While the former was a PySpark ETL application, the latter was several Python scripts using classical ML packages (such as Scikit-Learn).<\/p>\n<h3>Issues with this setup<\/h3>\n<p>Although this setup proved quite stable, troubleshooting the infrastructure or improving the model wasn\u2019t easy. A key problem with the model was its complexity, because some components had begun a life on their own: in the end, the code of the outlier detection was almost twice as long the code of the core IMX model itself.<\/p>\n<p>The core model, in fact, wasn\u2019t one model, but hundreds: one model per residential real estate type and region, with the definition varying from a single neighborhood in a big city to several villages in rural areas. We had, for example, one model for apartments for sale in the middle of Berlin and one model for houses for sale in a suburb of Munich. Because setting up the training of all these models took a lot of time, we omitted the hyperparameter tuning, which likely led to the models underperforming.<\/p>\n<h3>Why we decided on SageMaker<\/h3>\n<p>Given these issues and our ambition of having a market report with practical benefits, we had to decide between rewriting large parts of the existing code or starting from scratch. As you can infer from this post, we opted for the latter. But why SageMaker?<\/p>\n<p>Most of our time spent on the IMX went into troubleshooting the infrastructure, not improving the model. For the new market report, we wanted to flip this around, with the focus on the statistical performance of the model. We also wanted to have the flexibility to quickly replace individual components of the model, such as the optimization of the hyperparameters. What if a new superior boosting algorithm comes around (think about how XGBoost hit the stage in 2014)? Of course, we want to adopt it as one of the first!<\/p>\n<p>In SageMaker, the major components of the classical ML workflow\u2014preprocessing, training, hyperparameter tuning, and inference\u2014are neatly separated on the API level and also on the <a href=\"http:\/\/aws.amazon.com\/console\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Management Console<\/a>. Modifying them individually isn\u2019t difficult.<\/p>\n<h2>The new model<\/h2>\n<p>In this section, we discuss the components of the new model, including its input data, algorithm, hyperparameter tuning, and technical setup.<\/p>\n<h3>Input data<\/h3>\n<p>The WohnBarometer is based on a sliding window of 5 years of ImmoScout24 listings of residential real estate located in Germany. After we remove outliers and fraudulent listings, we\u2019re left with approximately 4 million listings that are split into train (60 %), validation (20 %), and test data (20 %). The relationship between listings and objects is not necessarily 1:1; over the course of 5 years, it\u2019s likely that the same object is inserted multiple times (by multiple people).<\/p>\n<p>We use 13 listing attributes, such as the location of the property (WGS84 coordinates), the real estate type (house or apartment, sale or rent), its age (years), its size (square meter) or it\u2019s condition (for example, new or refurbished). Given that each listing typically comes with dozens of attributes, the question arises: which to include in the model? On the one hand, we used domain knowledge; for example, it\u2019s well known that location is a key factor, and in almost all markets new property is more expensive than existing ones. On the other hand, we relied on our experiences with the IMX and similar models. There we learned that including dozens of attributes doesn\u2019t significantly improve the model.<\/p>\n<p>Depending on the real estate type of the listing, the target variable of our model is either the rent per square meter or the sale price per square meter (we explain later why this choice wasn\u2019t ideal). Unlike the IMX, the WohnBarometer is therefore a number that can be directly interpreted and acted upon by our customers.<\/p>\n<h3>Model description<\/h3>\n<p>When using SageMaker, you can choose between different strategies of implementing your algorithm:<\/p>\n<ul>\n<li>Use one of SageMaker\u2019s built-in algorithms. There are almost 20 and they cover all major ML problem types.<\/li>\n<li>Customize a pre-made Docker image based on a standard ML framework (such as Scikit-Learn or PyTorch).<\/li>\n<li>Build your own algorithm and deploy it as a Docker image.<\/li>\n<\/ul>\n<p>For the WohnBarometer, we wanted a solution that is easy to maintain and allows us to focus on improving the model itself, not the underlying infrastructure. Therefore, we decided on the first option: use a fully-managed algorithm with proper documentation and fast support if needed. Next, we needed to pick the algorithm itself. Again, the decision wasn\u2019t difficult: we went for the XGBoost algorithm because it\u2019s one of the most renowned ML algorithms for regression type problems, and we have already successfully used it in several projects.<\/p>\n<h3>Hyperparameter tuning<\/h3>\n<p>Most ML algorithms come with a myriad of parameters to tweak. Boosting algorithms, for example, have many parameters specifying how exactly the trees are built: Do the trees have at maximum 20 or 30 leaves? Is each tree based on all rows and columns or only samples? How heavily to prune the trees? Finding the optimal values of those parameters (as measured by an evaluation metric of your choice), the so-called hyperparameter tuning, is critical to building a powerful ML model.<\/p>\n<p>A key question in hyperparameter tuning is which parameters to tune and how to set the search ranges. You might ask, why not check all possible combinations? Although in theory this sounds like a good idea, it would result in an enormous hyperparameter space with way too many points to evaluate them all at a reasonable price. That is why ML practitioners typically select a small number of hyperparameters known to have a strong impact on the performance of the chosen algorithm.<\/p>\n<p>After the hyperparameter space is defined, the next task is to find the best combination of values in it. The following techniques are commonly employed:<\/p>\n<ul>\n<li><strong>Grid search<\/strong> \u2013 Divide the space in a discrete grid and then evaluate all points in the grid with cross-validation.<\/li>\n<li><strong>Random search<\/strong> \u2013 Randomly draw combinations from the space. With this approach, you\u2019ll most likely miss the best combination, but it serves as a good benchmark.<\/li>\n<li><strong>Bayesian optimization <\/strong>\u2013 Build a probabilistic model of the objective function and use this model to generate new combinations. The model is updated after each combination, leading quickly to good results.<\/li>\n<\/ul>\n<p>In recent years, thanks to cheap compute power, Bayesian optimization has become the gold standard in hyperparameter tuning, and is the default setting in SageMaker.<\/p>\n<h3>Technical setup<\/h3>\n<p>As with many other AWS services, you can create SageMaker jobs on the console, with the <a href=\"http:\/\/aws.amazon.com\/cli\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Command Line Interface<\/a> (AWS CLI), or via code. We chose the third option, the SageMaker Python SDK to be precise, because it allows for a highly automated setup: the WohnBarometer lives in a Python software project that is command-line executable. For example, all steps of the ML pipeline such as the preprocessing or the model training can be triggered via Bash commands. Those Bash commands, in turn, are orchestrated with a Jenkins pipeline powered by <a href=\"https:\/\/aws.amazon.com\/fargate\" target=\"_blank\" rel=\"noopener noreferrer\">AWS Fargate<\/a>.<\/p>\n<p>Let\u2019s look at the steps and the underlying infrastructure:<\/p>\n<ul>\n<li><strong>Preprocessing<\/strong> \u2013 The preprocessing is done with the built-in Scikit-Learn library in SageMaker. Because it involves joining data frames with millions of rows, we need an ml.m5.24xlarge machine here, the largest you can get in the ml.m family. Alternatively, we could have used multiple smaller machines with a distributed framework like Dask, but we wanted to keep it as simple as possible.<\/li>\n<li><strong>Training<\/strong> \u2013 We use the default SageMaker XGBoost algorithm. The training is done with two ml.m5.12xlarge machines. It\u2019s worth mentioning that our train.py containing the code of the model training and the hyperparameter tuning has less than 100 rows.<\/li>\n<li><strong>Hyperparameter tuning<\/strong> \u2013 Following the principle of less is more, we only tune 11 hyperparameters (for example, the number of boosting rounds and the learning rate), which gives us time to carefully choose their ranges and inspect how they interact with each other. With only a few hyperparameters, each training job runs relatively fast; in our case the jobs take between 10\u201320 minutes. With a maximal number of 30 training jobs and 2 concurrent jobs, the total training time is around 3 hours.<\/li>\n<li><strong>Inference<\/strong> \u2013 SageMaker offers multiple options to serve your model. We use batch transform jobs because we only need the WohnBarometer numbers once a quarter. We didn\u2019t use an endpoint because it would be idle most of the time. Each batch job (approximately 6.8 million rows) is served by a single ml.m5.4xlarge machine in less than 10 minutes.<\/li>\n<\/ul>\n<p>We can easily debug these steps on the SageMaker console. If, for example, a training job is taking longer than expected, we navigate to the <strong>Training <\/strong>page, locate the training job in question, and review <a href=\"http:\/\/aws.amazon.com\/cloudwatch\" target=\"_blank\" rel=\"noopener noreferrer\">Amazon CloudWatch<\/a> metrics of the underlying machines.<\/p>\n<p>The following architecture diagram shows the infrastructure of the WohnBarometer:<\/p>\n<p><a href=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/ML-4143-image003.png\"><img decoding=\"async\" loading=\"lazy\" class=\"alignnone size-full wp-image-33034\" src=\"https:\/\/d2908q01vomqb2.cloudfront.net\/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59\/2022\/02\/14\/ML-4143-image003.png\" alt=\"\" width=\"667\" height=\"297\"><\/a><\/p>\n<h2>Challenges and learnings<\/h2>\n<p>In the beginning everything went smoothly: within a few days we set up the software project and trained a miniature version of our model in SageMaker. We had high hopes for the first run on the full dataset and the hyperparameter tuning in place. Unfortunately, the results weren\u2019t satisfying. We had the following key issues:<\/p>\n<ul>\n<li>The predictions of the model were too low, both for rent and sale objects. For Berlin, for example, the sale prices predicted for our reference objects were roughly 50% below the market prices.<\/li>\n<li>According to the model, there was no significant price difference between new and existing buildings. The truth is that new buildings are almost always significantly more expensive than existing buildings.<\/li>\n<li>The effect of the location on the price wasn\u2019t captured correctly. We know, for example, that apartments for sale in Frankfurt am Main, are, on average, more expensive than in Berlin (although Berlin is catching up); our model, however, predicted it the other way round.<\/li>\n<\/ul>\n<p>What was the problem and how did we solve it?<\/p>\n<h3>Sampling of the features<\/h3>\n<p>At first glance, it looks like the issues aren\u2019t related, but indeed they are. By default, XGBoost builds each tree with a random sample of the features. Let\u2019s say a model has 10 features F<sub>1<\/sub>, F<sub>2<\/sub>, \u2026 F<sub>10<\/sub>, then the algorithm might use F<sub>1<\/sub>, F<sub>4<\/sub>, and F<sub>7<\/sub> for one tree, and F<sub>3<\/sub>, F<sub>4<\/sub>, and F<sub>8<\/sub> for another. While in general this behavior effectively prevents overfitting, it can be problematic if the number of features is small and some of them have a big effect on the target variable. In this case, many trees will miss the crucial features.<\/p>\n<p>XGBoost\u2019s sampling of our 13 features led to many trees including neither of the crucial features\u2014real estate type, location, and new or existing buildings\u2014and as a consequence caused these issues. Luckily, there is a parameter to control the sampling: <code>colsample_bytree<\/code> (in fact, there are two more parameters to control the sampling, but we didn\u2019t touch them). When we checked our code, we saw that <code>colsample_bytree<\/code> was set to 0.5, a value we carried over from past projects. As soon as we set it to the default value of 1, the preceding issues were gone.<\/p>\n<h3>One model vs. multiple models<\/h3>\n<p>Unlike the IMX, the WohnBarometer model really is only one model. Although this minimizes the maintenance effort, it\u2019s not ideal from a statistical point of view. Because our training data contains both sale and rent objects, the spread in the target variable is huge: it ranges from below 5 Euro for some rent apartments to well above 10,000 Euro for houses for sale in first-class locations. The big challenge for the model is to understand that an error of 5 Euro is fantastic for sale objects, but disastrous for rent objects.<\/p>\n<p>In hindsight, knowing how easy it is to maintain multiple models in SageMaker, we would have built at least two models: one for rent and one for sale objects. This would make it easier to capture the peculiarities of both markets. For example, the price of unrented apartments for sale is typically 20\u201330% higher than for rented apartments for sale. Therefore, encoding this information as a dummy variable in the sale model makes a lot of sense; for the rent model on the other hand, you could leave it out.<\/p>\n<h2>Conclusion<\/h2>\n<p>Did the WohnBarometer meet the goal of being relevant to our customers? Taking media coverage as an indication, the answer is a clear yes: as of November 2021, more than 700 newspaper articles and TV or radio reports on the WohnBarometer have been published. The list includes national newspapers such as Frankfurter Allgemeine Zeitung, Tagesspiegel, and Handelsblatt, and local newspapers that often ask for WohnBarometer figures for their region. Because we calculate the figures for all regions of Germany anyway, we\u2019re happy to take such requests. With the old IMX, this level of granularity wasn\u2019t possible.<\/p>\n<p>The WohnBarometer outperforms the IMX in regards to statical performance, in particular when it comes to the costs: the IMX was generated by an EMR cluster with 10 task nodes running almost half a day. In contrast, all WohnBarometer steps take less than 5 hours using medium-sized machines. This results in cost savings of almost 75%.<\/p>\n<p>Thanks to SageMaker, we were able to bring a complex ML model in production with one data scientist in less than 2 months. This is remarkable. 10 years earlier, when ImmoScout24 built the IMX, reaching the same milestone took more than 2 years and involved a whole team.<\/p>\n<p>How could we be so efficient? SageMaker allowed us to focus on the model instead of the infrastructure, and SageMaker promotes a microservice architecture that is easy to maintain. If we got stuck with something, we could call on AWS support. In the past, when one of our IMX data pipelines failed, we would sometimes spend days to debug it. Since we started publishing WohnBarometer figures in April 2021, the SageMaker infrastructure hasn\u2019t failed a single time.<\/p>\n<p>To learn more about the WohnBarometer, check out <a href=\"https:\/\/www.immobilienscout24.de\/wissen\/kaufen\/wohnbarometer.html\" target=\"_blank\" rel=\"noopener noreferrer\">WohnBarometer<\/a> and <a href=\"https:\/\/www.immobilienscout24.de\/unternehmen\/news-medien\/news\/default-title\/immoscout24-wohnbarometer-angebotsmieten-stiegen-2021-bundesweit-wieder-staerker-an\/\" target=\"_blank\" rel=\"noopener noreferrer\">WohnBarometer: Angebotsmieten stiegen 2021 bundesweit wieder st\u00e4rker an<\/a>. To learn more about using the SageMaker Scikit-Learn library for preprocessing, see <a href=\"https:\/\/aws.amazon.com\/blogs\/machine-learning\/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn\/\" target=\"_blank\" rel=\"noopener noreferrer\">Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn<\/a>. Please send us feedback, either on the <a href=\"https:\/\/forums.aws.amazon.com\/forum.jspa?forumID=285\" target=\"_blank\" rel=\"noopener noreferrer\">AWS forum for Amazon SageMaker<\/a>, or through your AWS support contacts.<\/p>\n<p>The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.<\/p>\n<hr>\n<h3>About the Authors<\/h3>\n<p><strong>Oliver Frost<\/strong> joined ImmoScout24 in 2017 as a business analyst. Two years later, he became a data scientist in a team whose job it is to turn ImmoScout24 data into veritable data products. Before building the WohnBarometer model, he ran smaller SageMaker projects. Oliver holds several AWS certificates, including the Machine Learning Specialty.<\/p>\n<p><strong>Lukas M\u00fcller<\/strong> is a Solutions Architect at AWS. He works with customers in the sports, media, and entertainment industries. He is always looking for ways to combine technical enablement with cultural and organizational enablement to help customers achieve business value with cloud technologies.<\/p>\n<p>       <!-- '\"` -->\n      <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/aws.amazon.com\/blogs\/machine-learning\/predict-residential-real-estate-prices-at-immoscout24-with-amazon-sagemaker\/<\/p>\n","protected":false},"author":0,"featured_media":1943,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1942"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1942"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1942\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1943"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1942"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1942"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1942"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}