{"id":1077,"date":"2021-10-23T08:42:05","date_gmt":"2021-10-23T08:42:05","guid":{"rendered":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/23\/a-gentle-introduction-to-vector-space-models\/"},"modified":"2021-10-23T08:42:05","modified_gmt":"2021-10-23T08:42:05","slug":"a-gentle-introduction-to-vector-space-models","status":"publish","type":"post","link":"https:\/\/salarydistribution.com\/machine-learning\/2021\/10\/23\/a-gentle-introduction-to-vector-space-models\/","title":{"rendered":"A Gentle Introduction to Vector Space Models"},"content":{"rendered":"<div id=\"\">\n<p id=\"last-modified-info\">Last Updated on October 23, 2021<\/p>\n<p>Vector space models are to consider the relationship between data that are represented by vectors. It is popular in information retrieval systems but also useful for other purposes. Generally, this allows us to compare the similarity of two vectors from a geometric perspective.<\/p>\n<p>In this tutorial, we will see what is a vector space model and what it can do.<\/p>\n<p>After completing this tutorial, you will know:<\/p>\n<ul>\n<li>What is a vector space model and the properties of cosine similarity<\/li>\n<li>How cosine similarity can help you compare two vectors<\/li>\n<li>What is the difference between cosine similarity and L2 distance<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<div id=\"attachment_4963\" class=\"wp-caption aligncenter\"><img aria-describedby=\"caption-attachment-4963\" loading=\"lazy\" class=\"size-full wp-image-4963\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/29539982252_f2d3e260be_k.jpg\" alt=\"A Gentle Introduction to Sparse Matrices for Machine Learning\" width=\"640\" height=\"480\"><img decoding=\"async\" aria-describedby=\"caption-attachment-4963\" loading=\"lazy\" class=\"size-full wp-image-4963\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/29539982252_f2d3e260be_k.jpg\" alt=\"A Gentle Introduction to Sparse Matrices for Machine Learning\" width=\"640\" height=\"480\"><\/p>\n<p id=\"caption-attachment-4963\" class=\"wp-caption-text\">A Gentle Introduction to Vector Space Models<br \/>Photo by <a href=\"https:\/\/www.flickr.com\/photos\/49317207@N02\/29539982252\/\">liamfletch<\/a>, some rights reserved.<\/p>\n<\/div>\n<h2>Tutorial overview<\/h2>\n<p>This tutorial is divided into 3 parts; they are:<\/p>\n<ol>\n<li>Vector space and cosine formula<\/li>\n<li>Using vector space model for similarity<\/li>\n<li>Common use of vector space models and cosine distance<\/li>\n<\/ol>\n<h2>Vector space and cosine formula<\/h2>\n<p>A vector space is a mathematical term that defines some vector operations. In layman\u2019s term, we can imagine it is a $n$-dimensional metric space where each point is represented by a $n$-dimensional vector. In this space, we can do any vector addition or scalar-vector multiplications.<\/p>\n<p>It is useful to consider a vector space because it is useful to represent things as a vector. For example in machine learning, we usually have a data point with multiple features. Therefore, it is convenient for us to represent a data point as a vector.<\/p>\n<p>With a vector, we can compute its <strong>norm<\/strong>. The most common one is the L2-norm or the length of the vector. With two vectors in the same vector space, we can find their difference. Assume it is a 3-dimensional vector space, the two vectors are $(x_1, x_2, x_3)$ and $(y_1, y_2, y_3)$. Their difference is the vector $(y_1-x_1, y_2-x_2, y_3-x_3)$, and the L2-norm of the difference is the <strong>distance<\/strong> or more precisely the Euclidean distance between those two vectors:<\/p>\n<p>$$<br \/>sqrt{(y_1-x_1)^2+(y_2-x_2)^2+(y_3-x_3)^2}<br \/>$$<\/p>\n<p>Besides distance, we can also consider the <strong>angle<\/strong> between two vectors. If we consider the vector $(x_1, x_2, x_3)$ as a line segment from the point $(0,0,0)$ to $(x_1,x_2,x_3)$ in the 3D coordinate system, then there is another line segment from $(0,0,0)$ to $(y_1,y_2, y_3)$. They make an angle at their intersection:<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter wp-image-13008\" data-cfsrc=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/cosine.png\" alt=\"\" width=\"450\" height=\"450\"><\/p>\n<p><img decoding=\"async\" loading=\"lazy\" class=\"aligncenter wp-image-13008\" src=\"https:\/\/machinelearningmastery.com\/wp-content\/uploads\/2021\/10\/cosine.png\" alt=\"\" width=\"450\" height=\"450\"><\/p>\n<p>The angle between the two line segments can be found using the cosine formula:<\/p>\n<p>$$<br \/>cos theta = frac{acdot b} {lVert arVert_2lVert brVert_2}<br \/>$$<\/p>\n<p>where $acdot b$ is the vector dot-product and $lVert arVert_2$ is the L2-norm of vector $a$. This formula arises from considering the dot-product as the projection of vector $a$ onto the direction as pointed by vector $b$. The nature of cosine tells that, as the angle $theta$ increases from 0 to 90 degrees, cosine decreases from 1 to 0. Sometimes we would call $1-costheta$ the <strong>cosine distance<\/strong> because it runs from 0 to 1 as the two vectors are moving further away from each other. This is an important property that we are going to exploit in the vector space model.<\/p>\n<h2>Using vector space model for similarity<\/h2>\n<p>Let\u2019s look at an example of how the vector space model is useful.<\/p>\n<p>World Bank collects various data about countries and regions in the world. While every country is different, we can try to compare countries under vector space model. For convenience, we will use the <code>pandas_datareader<\/code> module in Python to read data from World Bank. You may install <code>pandas_datareader<\/code> using <code>pip<\/code> or <code>conda<\/code> command:<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93b4756838352\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\npip install pandas_datareader<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>pip install pandas_datareader<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>The data series collected by World Bank are named by an identifier. For example, \u201cSP.URB.TOTL\u201d is the total urban population of a country. Many of the series are yearly. When we download a series, we have to put in the start and end years. Usually the data are not updated on time. Hence it is best to look at the data a few years back rather than the most recent year to avoid missing data.<\/p>\n<p>In below, we try to collect some economic data of <em>every<\/em> country in 2010:<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93bb971361413\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\nfrom pandas_datareader import wb<br \/>\nimport pandas as pd<br \/>\npd.options.display.width = 0<\/p>\n<p>names = [<br \/>\n    &#8220;NE.EXP.GNFS.CD&#8221;, # Exports of goods and services (current US$)<br \/>\n    &#8220;NE.IMP.GNFS.CD&#8221;, # Imports of goods and services (current US$)<br \/>\n    &#8220;NV.AGR.TOTL.CD&#8221;, # Agriculture, forestry, and fishing, value added (current US$)<br \/>\n    &#8220;NY.GDP.MKTP.CD&#8221;, # GDP (current US$)<br \/>\n    &#8220;NE.RSB.GNFS.CD&#8221;, # External balance on goods and services (current US$)<br \/>\n]<\/p>\n<p>df = wb.download(country=&#8221;all&#8221;, indicator=names, start=2010, end=2010).reset_index()<br \/>\ncountries = wb.get_countries()<br \/>\nnon_aggregates = countries[countries[&#8220;region&#8221;] != &#8220;Aggregates&#8221;].name<br \/>\ndf_nonagg = df[df[&#8220;country&#8221;].isin(non_aggregates)].dropna()<br \/>\nprint(df_nonagg)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">pandas_datareader <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">wb<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">pandas <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">pd<\/span><\/p>\n<p><span class=\"crayon-v\">pd<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">options<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">display<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">width<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NE.EXP.GNFS.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># Exports of goods and services (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NE.IMP.GNFS.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># Imports of goods and services (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NV.AGR.TOTL.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># Agriculture, forestry, and fishing, value added (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NY.GDP.MKTP.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># GDP (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NE.RSB.GNFS.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># External balance on goods and services (current US$)<\/span><\/p>\n<p><span class=\"crayon-sy\">]<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">df<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wb<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">download<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;all&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">indicator<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">start<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2010<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">end<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2010<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reset_index<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">countries<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wb<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">get_countries<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">non_aggregates<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">countries<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">countries<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;region&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">!=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Aggregates&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">name<\/span><\/p>\n<p><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;country&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">isin<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">non_aggregates<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">dropna<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-6173641ca93bd198843669\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n                 country  year  NE.EXP.GNFS.CD  NE.IMP.GNFS.CD  NV.AGR.TOTL.CD  NY.GDP.MKTP.CD  NE.RSB.GNFS.CD<br \/>\n50               Albania  2010    3.337089e+09    5.792189e+09    2.141580e+09    1.192693e+10   -2.455100e+09<br \/>\n51               Algeria  2010    6.197541e+10    5.065473e+10    1.364852e+10    1.612073e+11    1.132067e+10<br \/>\n54                Angola  2010    5.157282e+10    3.568226e+10    5.179055e+09    8.379950e+10    1.589056e+10<br \/>\n55   Antigua and Barbuda  2010    9.142222e+08    8.415185e+08    1.876296e+07    1.148700e+09    7.270370e+07<br \/>\n56             Argentina  2010    8.020887e+10    6.793793e+10    3.021382e+10    4.236274e+11    1.227093e+10<br \/>\n..                   &#8230;   &#8230;             &#8230;             &#8230;             &#8230;             &#8230;             &#8230;<br \/>\n259        Venezuela, RB  2010    1.121794e+11    6.922736e+10    2.113513e+10    3.931924e+11    4.295202e+10<br \/>\n260              Vietnam  2010    8.347359e+10    9.299467e+10    2.130649e+10    1.159317e+11   -9.521076e+09<br \/>\n262   West Bank and Gaza  2010    1.367300e+09    5.264300e+09    8.716000e+08    9.681500e+09   -3.897000e+09<br \/>\n264               Zambia  2010    7.503513e+09    6.256989e+09    1.909207e+09    2.026556e+10    1.246524e+09<br \/>\n265             Zimbabwe  2010    3.569254e+09    6.440274e+09    1.157187e+09    1.204166e+10   -2.871020e+09<\/p>\n<p>[174 rows x 7 columns]<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 country\u00a0\u00a0year\u00a0\u00a0NE.EXP.GNFS.CD\u00a0\u00a0NE.IMP.GNFS.CD\u00a0\u00a0NV.AGR.TOTL.CD\u00a0\u00a0NY.GDP.MKTP.CD\u00a0\u00a0NE.RSB.GNFS.CD<\/p>\n<p>50\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Albania\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a03.337089e+09\u00a0\u00a0\u00a0\u00a05.792189e+09\u00a0\u00a0\u00a0\u00a02.141580e+09\u00a0\u00a0\u00a0\u00a01.192693e+10\u00a0\u00a0 -2.455100e+09<\/p>\n<p>51\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Algeria\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a06.197541e+10\u00a0\u00a0\u00a0\u00a05.065473e+10\u00a0\u00a0\u00a0\u00a01.364852e+10\u00a0\u00a0\u00a0\u00a01.612073e+11\u00a0\u00a0\u00a0\u00a01.132067e+10<\/p>\n<p>54\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Angola\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a05.157282e+10\u00a0\u00a0\u00a0\u00a03.568226e+10\u00a0\u00a0\u00a0\u00a05.179055e+09\u00a0\u00a0\u00a0\u00a08.379950e+10\u00a0\u00a0\u00a0\u00a01.589056e+10<\/p>\n<p>55\u00a0\u00a0 Antigua and Barbuda\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a09.142222e+08\u00a0\u00a0\u00a0\u00a08.415185e+08\u00a0\u00a0\u00a0\u00a01.876296e+07\u00a0\u00a0\u00a0\u00a01.148700e+09\u00a0\u00a0\u00a0\u00a07.270370e+07<\/p>\n<p>56\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Argentina\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a08.020887e+10\u00a0\u00a0\u00a0\u00a06.793793e+10\u00a0\u00a0\u00a0\u00a03.021382e+10\u00a0\u00a0\u00a0\u00a04.236274e+11\u00a0\u00a0\u00a0\u00a01.227093e+10<\/p>\n<p>..\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8230;\u00a0\u00a0 &#8230;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8230;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8230;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8230;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8230;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 &#8230;<\/p>\n<p>259\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Venezuela, RB\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a01.121794e+11\u00a0\u00a0\u00a0\u00a06.922736e+10\u00a0\u00a0\u00a0\u00a02.113513e+10\u00a0\u00a0\u00a0\u00a03.931924e+11\u00a0\u00a0\u00a0\u00a04.295202e+10<\/p>\n<p>260\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0Vietnam\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a08.347359e+10\u00a0\u00a0\u00a0\u00a09.299467e+10\u00a0\u00a0\u00a0\u00a02.130649e+10\u00a0\u00a0\u00a0\u00a01.159317e+11\u00a0\u00a0 -9.521076e+09<\/p>\n<p>262\u00a0\u00a0 West Bank and Gaza\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a01.367300e+09\u00a0\u00a0\u00a0\u00a05.264300e+09\u00a0\u00a0\u00a0\u00a08.716000e+08\u00a0\u00a0\u00a0\u00a09.681500e+09\u00a0\u00a0 -3.897000e+09<\/p>\n<p>264\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Zambia\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a07.503513e+09\u00a0\u00a0\u00a0\u00a06.256989e+09\u00a0\u00a0\u00a0\u00a01.909207e+09\u00a0\u00a0\u00a0\u00a02.026556e+10\u00a0\u00a0\u00a0\u00a01.246524e+09<\/p>\n<p>265\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Zimbabwe\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a03.569254e+09\u00a0\u00a0\u00a0\u00a06.440274e+09\u00a0\u00a0\u00a0\u00a01.157187e+09\u00a0\u00a0\u00a0\u00a01.204166e+10\u00a0\u00a0 -2.871020e+09<\/p>\n<p>\u00a0<\/p>\n<p>[174 rows x 7 columns]<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>In the above we obtained some economic metrics of each country in 2010. The function <code>wb.download()<\/code> will download the data from World Bank and return a pandas dataframe. Similarly <code>wb.get_countries()<\/code> will get the name of the countries and regions as identified by World Bank, which we will use this to filter out the non-countries aggregates such as \u201cEast Asia\u201d and \u201cWorld\u201d. Pandas allows filtering rows by boolean indexing, which <code>df[\"country\"].isin(non_aggregates)<\/code> gives a boolean vector of which row is in the list of <code>non_aggregates<\/code> and based on that, <code>df[df[\"country\"].isin(non_aggregates)]<\/code> selects only those. For various reasons not all countries will have all data. Hence we use <code>dropna()<\/code> to remove those with missing data. In practice, we may want to apply some imputation techniques instead of merely removing them. But as an example, we proceed with the 174 remaining data points.<\/p>\n<p>To better illustrate the idea rather than hiding the actual manipulation in pandas or numpy functions, we first extract the data for each country as a vector:<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93be874257446\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\nvectors = {}<br \/>\nfor rowid, row in df_nonagg.iterrows():<br \/>\n    vectors[row[&#8220;country&#8221;]] = row[names].values<\/p>\n<p>print(vectors)<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">rowid<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">row <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">iterrows<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;country&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">values<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-6173641ca93bf897737787\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n{&#8216;Albania&#8217;: array([3337088824.25553, 5792188899.58985, 2141580308.0144,<br \/>\n11926928505.5231, -2455100075.33431], dtype=object),<br \/>\n&#8216;Algeria&#8217;: array([61975405318.205, 50654732073.2396, 13648522571.4516,<br \/>\n161207310515.42, 11320673244.9655], dtype=object),<br \/>\n&#8216;Angola&#8217;: array([51572818660.8665, 35682259098.1843, 5179054574.41704,<br \/>\n83799496611.2004, 15890559562.6822], dtype=object),<br \/>\n&#8230;<br \/>\n&#8216;West Bank and Gaza&#8217;: array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0,<br \/>\n-3897000000.0], dtype=object),<br \/>\n&#8216;Zambia&#8217;: array([7503512538.82554, 6256988597.27752, 1909207437.82702,<br \/>\n20265559483.8548, 1246523941.54802], dtype=object),<br \/>\n&#8216;Zimbabwe&#8217;: array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0,<br \/>\n-2871019600.0], dtype=object)}<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>{&#8216;Albania&#8217;: array([3337088824.25553, 5792188899.58985, 2141580308.0144,<\/p>\n<p>11926928505.5231, -2455100075.33431], dtype=object),<\/p>\n<p>&#8216;Algeria&#8217;: array([61975405318.205, 50654732073.2396, 13648522571.4516,<\/p>\n<p>161207310515.42, 11320673244.9655], dtype=object),<\/p>\n<p>&#8216;Angola&#8217;: array([51572818660.8665, 35682259098.1843, 5179054574.41704,<\/p>\n<p>83799496611.2004, 15890559562.6822], dtype=object),<\/p>\n<p>&#8230;<\/p>\n<p>&#8216;West Bank and Gaza&#8217;: array([1367300000.0, 5264300000.0, 871600000.0, 9681500000.0,<\/p>\n<p>-3897000000.0], dtype=object),<\/p>\n<p>&#8216;Zambia&#8217;: array([7503512538.82554, 6256988597.27752, 1909207437.82702,<\/p>\n<p>20265559483.8548, 1246523941.54802], dtype=object),<\/p>\n<p>&#8216;Zimbabwe&#8217;: array([3569254400.0, 6440274000.0, 1157186600.0, 12041655200.0,<\/p>\n<p>-2871019600.0], dtype=object)}<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>The Python dictionary we created has the name of each country as a key and the economic metrics as a numpy array. There are 5 metrics, hence each is a vector of 5 dimensions.<\/p>\n<p>What this helps us is that, we can use the vector representation of each country to see how similar it is to another. Let\u2019s try both the L2-norm of the difference (the Euclidean distance) and the cosine distance. We pick one country, such as Australia, and compare it to all other countries on the list based on the selected economic metrics.<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93c4531463049\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\nimport numpy as np<\/p>\n<p>euclid = {}<br \/>\ncosine = {}<br \/>\ntarget = &#8220;Australia&#8221;<\/p>\n<p>for country in vectors:<br \/>\n    vecA = vectors[target]<br \/>\n    vecB = vectors[country]<br \/>\n    dist = np.linalg.norm(vecA &#8211; vecB)<br \/>\n    cos = (vecA @ vecB) \/ (np.linalg.norm(vecA) * np.linalg.norm(vecB))<br \/>\n    euclid[country] = dist    # Euclidean distance<br \/>\n    cosine[country] = 1-cos   # cosine distance<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">np<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">euclid<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p><span class=\"crayon-v\">cosine<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p><span class=\"crayon-v\">target<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Australia&#8221;<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">country <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vecA<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">target<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">dist<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">linalg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">norm<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vecA<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">vecA<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">linalg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">norm<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vecA<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">linalg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">norm<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">euclid<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">dist<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Euclidean distance<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cosine<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">cos<\/span><span class=\"crayon-h\">\u00a0\u00a0 <\/span><span class=\"crayon-p\"># cosine distance<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>In the for-loop above, we set <code>vecA<\/code> as the vector of the target country (i.e., Australia) and <code>vecB<\/code> as that of the other country. Then we compute the L2-norm of their difference as the Euclidean distance between the two vectors. We also compute the cosine similarity using the formula and minus it from 1 to get the cosine distance. With more than a hundred countries, we can see which one has the shortest Euclidean distance to Australia:<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93cc409801577\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\nimport pandas as pd<\/p>\n<p>df_distance = pd.DataFrame({&#8220;euclid&#8221;: euclid, &#8220;cos&#8221;: cosine})<br \/>\nprint(df_distance.sort_values(by=&#8221;euclid&#8221;).head())<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">pandas <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">pd<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">df_distance<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pd<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">DataFrame<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-s\">&#8220;euclid&#8221;<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">euclid<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;cos&#8221;<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cosine<\/span><span class=\"crayon-sy\">}<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df_distance<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">sort_values<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">by<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;euclid&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">head<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-6173641ca93cd201049844\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n                 euclid           cos<br \/>\nAustralia  0.000000e+00 -2.220446e-16<br \/>\nMexico     1.533802e+11  7.949549e-03<br \/>\nSpain      3.411901e+11  3.057903e-03<br \/>\nTurkey     3.798221e+11  3.502849e-03<br \/>\nIndonesia  4.083531e+11  7.417614e-03<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 euclid\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cos<\/p>\n<p>Australia\u00a0\u00a00.000000e+00 -2.220446e-16<\/p>\n<p>Mexico\u00a0\u00a0\u00a0\u00a0 1.533802e+11\u00a0\u00a07.949549e-03<\/p>\n<p>Spain\u00a0\u00a0\u00a0\u00a0\u00a0\u00a03.411901e+11\u00a0\u00a03.057903e-03<\/p>\n<p>Turkey\u00a0\u00a0\u00a0\u00a0 3.798221e+11\u00a0\u00a03.502849e-03<\/p>\n<p>Indonesia\u00a0\u00a04.083531e+11\u00a0\u00a07.417614e-03<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>By sorting the result, we can see that Mexico is the closest to Australia under Euclidean distance. However, with cosine distance, it is Colombia the closest to Australia.<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93ce825772459\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\ndf_distance.sort_values(by=&#8221;cos&#8221;).head()<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-v\">df_distance<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">sort_values<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">by<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;cos&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">head<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-6173641ca93cf326900757\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n                 euclid           cos<br \/>\nAustralia  0.000000e+00 -2.220446e-16<br \/>\nColombia   8.981118e+11  1.720644e-03<br \/>\nCuba       1.126039e+12  2.483993e-03<br \/>\nItaly      1.088369e+12  2.677707e-03<br \/>\nArgentina  7.572323e+11  2.930187e-03<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 euclid\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 cos<\/p>\n<p>Australia\u00a0\u00a00.000000e+00 -2.220446e-16<\/p>\n<p>Colombia\u00a0\u00a0 8.981118e+11\u00a0\u00a01.720644e-03<\/p>\n<p>Cuba\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 1.126039e+12\u00a0\u00a02.483993e-03<\/p>\n<p>Italy\u00a0\u00a0\u00a0\u00a0\u00a0\u00a01.088369e+12\u00a0\u00a02.677707e-03<\/p>\n<p>Argentina\u00a0\u00a07.572323e+11\u00a0\u00a02.930187e-03<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>To understand why the two distances give different result, we can observe how the three countries\u2019 metric compare to each other:<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93d0223561768\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n&#8230;<br \/>\nprint(df_nonagg[df_nonagg.country.isin([&#8220;Mexico&#8221;, &#8220;Colombia&#8221;, &#8220;Australia&#8221;])])<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-sy\">.<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">isin<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;Mexico&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Colombia&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Australia&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<div id=\"urvanov-syntax-highlighter-6173641ca93d1084424313\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\n       country  year  NE.EXP.GNFS.CD  NE.IMP.GNFS.CD  NV.AGR.TOTL.CD  NY.GDP.MKTP.CD  NE.RSB.GNFS.CD<br \/>\n59   Australia  2010    2.270501e+11    2.388514e+11    2.518718e+10    1.146138e+12   -1.180129e+10<br \/>\n91    Colombia  2010    4.682683e+10    5.136288e+10    1.812470e+10    2.865631e+11   -4.536047e+09<br \/>\n176     Mexico  2010    3.141423e+11    3.285812e+11    3.405226e+10    1.057801e+12   -1.443887e+10<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 country\u00a0\u00a0year\u00a0\u00a0NE.EXP.GNFS.CD\u00a0\u00a0NE.IMP.GNFS.CD\u00a0\u00a0NV.AGR.TOTL.CD\u00a0\u00a0NY.GDP.MKTP.CD\u00a0\u00a0NE.RSB.GNFS.CD<\/p>\n<p>59\u00a0\u00a0 Australia\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a02.270501e+11\u00a0\u00a0\u00a0\u00a02.388514e+11\u00a0\u00a0\u00a0\u00a02.518718e+10\u00a0\u00a0\u00a0\u00a01.146138e+12\u00a0\u00a0 -1.180129e+10<\/p>\n<p>91\u00a0\u00a0\u00a0\u00a0Colombia\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a04.682683e+10\u00a0\u00a0\u00a0\u00a05.136288e+10\u00a0\u00a0\u00a0\u00a01.812470e+10\u00a0\u00a0\u00a0\u00a02.865631e+11\u00a0\u00a0 -4.536047e+09<\/p>\n<p>176\u00a0\u00a0\u00a0\u00a0 Mexico\u00a0\u00a02010\u00a0\u00a0\u00a0\u00a03.141423e+11\u00a0\u00a0\u00a0\u00a03.285812e+11\u00a0\u00a0\u00a0\u00a03.405226e+10\u00a0\u00a0\u00a0\u00a01.057801e+12\u00a0\u00a0 -1.443887e+10<\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<p>From this table, we see that the metrics of Australia and Mexico are very close to each other in magnitude. However, if you compare the ratio of each metric within the same country, it is Colombia that match Australia better. In fact from the cosine formula, we can see that<\/p>\n<p>$$<br \/>cos theta = frac{acdot b} {lVert arVert_2lVert brVert_2} = frac{a}{lVert arVert_2} cdot frac{b} {lVert brVert_2}<br \/>$$<\/p>\n<p>which means the cosine of the angle between the two vector is the dot-product of the corresponding vectors after they were normalized to length of 1. Hence cosine distance is virtually applying a scaler to the data before computing the distance.<\/p>\n<p>Putting these altogether, the following is the complete code<\/p>\n<div id=\"urvanov-syntax-highlighter-6173641ca93d2552541133\" class=\"urvanov-syntax-highlighter-syntax crayon-theme-classic urvanov-syntax-highlighter-font-monaco urvanov-syntax-highlighter-os-pc print-yes notranslate\" data-settings=\" minimize scroll-mouseover disable-anim\">\n<p><textarea class=\"urvanov-syntax-highlighter-plain print-no\" data-settings=\"dblclick\" readonly><br \/>\nfrom pandas_datareader import wb<br \/>\nimport numpy as np<br \/>\nimport pandas as pd<br \/>\npd.options.display.width = 0<\/p>\n<p># Download data from World Bank<br \/>\nnames = [<br \/>\n    &#8220;NE.EXP.GNFS.CD&#8221;, # Exports of goods and services (current US$)<br \/>\n    &#8220;NE.IMP.GNFS.CD&#8221;, # Imports of goods and services (current US$)<br \/>\n    &#8220;NV.AGR.TOTL.CD&#8221;, # Agriculture, forestry, and fishing, value added (current US$)<br \/>\n    &#8220;NY.GDP.MKTP.CD&#8221;, # GDP (current US$)<br \/>\n    &#8220;NE.RSB.GNFS.CD&#8221;, # External balance on goods and services (current US$)<br \/>\n]<br \/>\ndf = wb.download(country=&#8221;all&#8221;, indicator=names, start=2010, end=2010).reset_index()<\/p>\n<p># We remove aggregates and keep only countries with no missing data<br \/>\ncountries = wb.get_countries()<br \/>\nnon_aggregates = countries[countries[&#8220;region&#8221;] != &#8220;Aggregates&#8221;].name<br \/>\ndf_nonagg = df[df[&#8220;country&#8221;].isin(non_aggregates)].dropna()<\/p>\n<p># Extract vector for each country<br \/>\nvectors = {}<br \/>\nfor rowid, row in df_nonagg.iterrows():<br \/>\n    vectors[row[&#8220;country&#8221;]] = row[names].values<\/p>\n<p># Compute the Euclidean and cosine distances<br \/>\neuclid = {}<br \/>\ncosine = {}<\/p>\n<p>target = &#8220;Australia&#8221;<br \/>\nfor country in vectors:<br \/>\n    vecA = vectors[target]<br \/>\n    vecB = vectors[country]<br \/>\n    dist = np.linalg.norm(vecA &#8211; vecB)<br \/>\n    cos = (vecA @ vecB) \/ (np.linalg.norm(vecA) * np.linalg.norm(vecB))<br \/>\n    euclid[country] = dist    # Euclidean distance<br \/>\n    cosine[country] = 1-cos   # cosine distance<\/p>\n<p># Print the results<br \/>\ndf_distance = pd.DataFrame({&#8220;euclid&#8221;: euclid, &#8220;cos&#8221;: cosine})<br \/>\nprint(&#8220;Closest by Euclidean distance:&#8221;)<br \/>\nprint(df_distance.sort_values(by=&#8221;euclid&#8221;).head())<br \/>\nprint()<br \/>\nprint(&#8220;Closest by Cosine distance:&#8221;)<br \/>\nprint(df_distance.sort_values(by=&#8221;cos&#8221;).head())<\/p>\n<p># Print the detail metrics<br \/>\nprint()<br \/>\nprint(&#8220;Detail metrics:&#8221;)<br \/>\nprint(df_nonagg[df_nonagg.country.isin([&#8220;Mexico&#8221;, &#8220;Colombia&#8221;, &#8220;Australia&#8221;])])<\/textarea><\/p>\n<div class=\"urvanov-syntax-highlighter-main\">\n<table class=\"crayon-table\">\n<tr class=\"urvanov-syntax-highlighter-row\">\n<td class=\"crayon-nums \" data-settings=\"show\">\n<div class=\"urvanov-syntax-highlighter-nums-content\">\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>33<\/p>\n<p>34<\/p>\n<p>35<\/p>\n<p>36<\/p>\n<p>37<\/p>\n<p>38<\/p>\n<p>39<\/p>\n<p>40<\/p>\n<p>41<\/p>\n<p>42<\/p>\n<p>43<\/p>\n<p>44<\/p>\n<p>45<\/p>\n<p>46<\/p>\n<p>47<\/p>\n<p>48<\/p>\n<p>49<\/p>\n<p>50<\/p>\n<\/div>\n<\/td>\n<td class=\"urvanov-syntax-highlighter-code\">\n<div class=\"crayon-pre\">\n<p><span class=\"crayon-e\">from <\/span><span class=\"crayon-e\">pandas_datareader <\/span><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">wb<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">numpy <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">np<\/span><\/p>\n<p><span class=\"crayon-e\">import <\/span><span class=\"crayon-e\">pandas <\/span><span class=\"crayon-st\">as<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">pd<\/span><\/p>\n<p><span class=\"crayon-v\">pd<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">options<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">display<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">width<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">0<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Download data from World Bank<\/span><\/p>\n<p><span class=\"crayon-v\">names<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">[<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NE.EXP.GNFS.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># Exports of goods and services (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NE.IMP.GNFS.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># Imports of goods and services (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NV.AGR.TOTL.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># Agriculture, forestry, and fishing, value added (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NY.GDP.MKTP.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># GDP (current US$)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-s\">&#8220;NE.RSB.GNFS.CD&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-p\"># External balance on goods and services (current US$)<\/span><\/p>\n<p><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-v\">df<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wb<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">download<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;all&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">indicator<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">start<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2010<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-st\">end<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-cn\">2010<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">reset_index<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># We remove aggregates and keep only countries with no missing data<\/span><\/p>\n<p><span class=\"crayon-v\">countries<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">wb<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">get_countries<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-v\">non_aggregates<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">countries<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">countries<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;region&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">!=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Aggregates&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">name<\/span><\/p>\n<p><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">df<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;country&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">isin<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">non_aggregates<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">dropna<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Extract vector for each country<\/span><\/p>\n<p><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">rowid<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">row <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">iterrows<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;country&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">row<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">names<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-i\">values<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Compute the Euclidean and cosine distances<\/span><\/p>\n<p><span class=\"crayon-v\">euclid<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p><span class=\"crayon-v\">cosine<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-sy\">}<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-v\">target<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Australia&#8221;<\/span><\/p>\n<p><span class=\"crayon-st\">for<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-e\">country <\/span><span class=\"crayon-st\">in<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-o\">:<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vecA<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">target<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vectors<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">]<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">dist<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">linalg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">norm<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vecA<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cos<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-i\">vecA<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">@<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">\/<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">linalg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">norm<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vecA<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">*<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">np<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">linalg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">norm<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">vecB<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">euclid<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-i\">dist<\/span><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-p\"># Euclidean distance<\/span><\/p>\n<p><span class=\"crayon-h\">\u00a0\u00a0\u00a0\u00a0<\/span><span class=\"crayon-v\">cosine<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-cn\">1<\/span><span class=\"crayon-o\">&#8211;<\/span><span class=\"crayon-i\">cos<\/span><span class=\"crayon-h\">\u00a0\u00a0 <\/span><span class=\"crayon-p\"># cosine distance<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Print the results<\/span><\/p>\n<p><span class=\"crayon-v\">df_distance<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">pd<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">DataFrame<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">{<\/span><span class=\"crayon-s\">&#8220;euclid&#8221;<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">euclid<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;cos&#8221;<\/span><span class=\"crayon-o\">:<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-v\">cosine<\/span><span class=\"crayon-sy\">}<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Closest by Euclidean distance:&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df_distance<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">sort_values<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">by<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;euclid&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">head<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Closest by Cosine distance:&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df_distance<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">sort_values<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">by<\/span><span class=\"crayon-o\">=<\/span><span class=\"crayon-s\">&#8220;cos&#8221;<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">head<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p>\u00a0<\/p>\n<p><span class=\"crayon-p\"># Print the detail metrics<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-s\">&#8220;Detail metrics:&#8221;<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<p><span class=\"crayon-e\">print<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-v\">df_nonagg<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-v\">country<\/span><span class=\"crayon-sy\">.<\/span><span class=\"crayon-e\">isin<\/span><span class=\"crayon-sy\">(<\/span><span class=\"crayon-sy\">[<\/span><span class=\"crayon-s\">&#8220;Mexico&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Colombia&#8221;<\/span><span class=\"crayon-sy\">,<\/span><span class=\"crayon-h\"> <\/span><span class=\"crayon-s\">&#8220;Australia&#8221;<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><span class=\"crayon-sy\">]<\/span><span class=\"crayon-sy\">)<\/span><\/p>\n<\/div>\n<\/td>\n<\/tr>\n<\/table><\/div>\n<\/p><\/div>\n<h2>Common use of vector space models and cosine distance<\/h2>\n<p>Vector space models are common in information retrieval systems. We can present documents (e.g., a paragraph, a long passage, a book, or even a sentence) as vectors. This vector can be as simple as counting of the words that the document contains (i.e., a bag-of-word model) or a complicated embedding vector (e.g., Doc2Vec). Then a query to find the most relevant document can be answered by ranking all documents by the cosine distance. Cosine distance should be used because we do not want to favor longer or shorter documents, but to focus on what it contains. Hence we leverage the normalization comes with it to consider how relevant are the documents to the query rather than how many times the words on the query are mentioned in a document.<\/p>\n<p>If we consider each word in a document as a feature and compute the cosine distance, it is the \u201chard\u201d distance because we do not care about words with similar meanings (e.g. \u201cdocument\u201d and \u201cpassage\u201d have similar meanings but not \u201cdistance\u201d). Embedding vectors such as word2vec would allow us to consider the ontology. Computing the cosine distance with the meaning of words considered is the \u201c<strong>soft cosine distance<\/strong>\u201c. Libraries such as gensim provides a way to do this.<\/p>\n<p>Another use case of the cosine distance and vector space model is in computer vision. Imagine the task of recognizing hand gesture, we can make certain parts of the hand (e.g. five fingers) the key points. Then with the (x,y) coordinates of the key points lay out as a vector, we can compare with our existing database to see which cosine distance is the closest and determine which hand gesture it is. We need cosine distance because everyone\u2019s hand has a different size. We do not want that to affect our decision on what gesture it is showing.<\/p>\n<p>As you may imagine, there are much more examples you can use this technique.<\/p>\n<h2>Further reading<\/h2>\n<p>This section provides more resources on the topic if you are looking to go deeper.<\/p>\n<h3>Books<\/h3>\n<h3>Software<\/h3>\n<h3>Articles<\/h3>\n<h2>Summary<\/h2>\n<p>In this tutorial, you discovered the vector space model for measuring the similarities of vectors.<\/p>\n<p>Specifically, you learned:<\/p>\n<ul>\n<li>How to construct a vector space model<\/li>\n<li>How to compute the cosine similarity and hence the cosine distance between two vectors in the vector space model<\/li>\n<li>How to interpret the difference between cosine distance and other distance metrics such as Euclidean distance<\/li>\n<li>What are the use of the vector space model<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<div class=\"widget_text awac-wrapper\" id=\"custom_html-69\">\n<div class=\"widget_text awac widget custom_html-69\">\n<div class=\"textwidget custom-html-widget\">\n<div>\n<h2>Get a Handle on Linear Algebra for Machine Learning!<\/h2>\n<p><a href=\"\/linear_algebra_for_machine_learning\/\" rel=\"nofollow\"><img width=\"220\" height=\"311\" data-cfstyle=\"border: 0;\" data-cfsrc=\"\/wp-content\/uploads\/2018\/01\/Cover-220-1.png\" alt=\"Linear Algebra for Machine Learning\" align=\"left\"><img decoding=\"async\" loading=\"lazy\" width=\"220\" height=\"311\" src=\"\/wp-content\/uploads\/2018\/01\/Cover-220-1.png\" alt=\"Linear Algebra for Machine Learning\" align=\"left\"><\/a><\/p>\n<h4>Develop a working understand of linear algebra<\/h4>\n<p>&#8230;by writing lines of code in python<\/p>\n<p>Discover how in my new Ebook:<br \/><a href=\"\/linear_algebra_for_machine_learning\/\" rel=\"nofollow\">Linear Algebra for Machine Learning<\/a><\/p>\n<p>It provides <strong>self-study tutorials<\/strong> on topics like:<br \/><em>Vector Norms, Matrix Multiplication, Tensors, Eigendecomposition, SVD, PCA<\/em> and much more&#8230;<\/p>\n<h4>Finally Understand the Mathematics of Data<\/h4>\n<p>Skip the Academics. Just Results.<\/p>\n<p><a href=\"\/linear_algebra_for_machine_learning\/\" class=\"woo-sc-button  red\"><span class=\"woo-\">See What&#8217;s Inside<\/span><\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n<\/div><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/machinelearningmastery.com\/a-gentle-introduction-to-vector-space-models\/<\/p>\n","protected":false},"author":0,"featured_media":1078,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[3],"tags":[],"_links":{"self":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1077"}],"collection":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/comments?post=1077"}],"version-history":[{"count":0,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/posts\/1077\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media\/1078"}],"wp:attachment":[{"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/media?parent=1077"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/categories?post=1077"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/salarydistribution.com\/machine-learning\/wp-json\/wp\/v2\/tags?post=1077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}