Ford Johnson
  • About
  • Projects
  • Blog

Euclidean Distance

Trends
Stats
Understanding the similarity between data points is crucial for effective analysis.
Published

November 1, 2023

Understanding the similarity between data points is crucial for effective analysis. Euclidean distance measures the separation between two sets of values by summing the squares of the differences between corresponding elements. This straightforward method helps reveal patterns and supports informed decision-making.

Example

In this demo, I applied Euclidean distance to compare electricity generation across countries using the TidyTuesday 06/06/2023 dataset. By selecting Germany as the target country, I identified which countries exhibit the most similar and most different trends.

Code
library(tidyverse)

owid_energy <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-06-06/owid-energy.csv')

countries <- c(
  "CAN", "MEX", "BLZ", "CRI", "SLV", "GTM", "HND",
  "NIC", "PAN", "BHS", "BRB", "CUB", "DOM", "HTI",
  "JAM", "TTO","ALB", "AND", "ARM", "AUT", "AZE",
  "BLR", "BEL", "BIH", "BGR", "HRV", "CYP", "CZE",
  "DNK", "EST", "FIN", "FRA", "GEO", "USA", "GRC",
  "HUN", "ISL", "IRL", "ITA", "KAZ", "LVA", "LIE",
  "LTU", "LUX", "MLT", "MDA", "MCO", "MNE", "NLD",
  "MKD", "NOR", "POL", "PRT", "ROU", "RUS", "SMR",
  "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR",
  "GBR", "VAT"
)

electricity_generation <- owid_energy %>%
  select(country, iso_code, year, electricity_generation) %>%
  filter(year > 2000 & iso_code %in% countries)

target_country <- "DEU" # Germany's 3 letter ISO country code

target_country_generation <- owid_energy %>%
  filter(year > 2000 & year < 2021 & iso_code == target_country) %>%
  select(year, electricity_generation) %>%
  rename(target_generation = electricity_generation)

countries_with_similarity_score <- electricity_generation %>%
  left_join(target_country_generation, by = "year") %>%
  group_by(country, iso_code) %>%
  summarize(euclidean_distance = sqrt(sum((electricity_generation - target_generation)^2, na.rm = TRUE))) %>%
  arrange(euclidean_distance)

Why Compare Similar Entities?

Comparing similar entities by analyzing a metric across a dimension is a versatile technique with broad real-world applications. It’s a useful way to establish a baseline and explore the impact of changes. Here’s why this method is valuable:

  • Establishing Baselines for Experimental Testing: Comparing similar entities helps set a clear baseline of normal or expected performance. This baseline is essential when testing new interventions or experimental changes, as it allows for a precise measurement of their impact.
  • Assessing Market & Policy Impact: Whether in business, public policy, or other fields, comparing key metrics across similar groups reveals how well certain interventions or strategies perform against a comparable standard.
  • Tracking Trends: By analyzing whether variations in a metric are isolated or part of a broader pattern, this approach distinguishes between widespread trends and localized anomalies, supporting more accurate forecasting and planning.
  • Broad Applicability: Although this demo focuses on comparing countries, the methodology can be applied across various domains, such as benchmarking, evaluating customer segments, or comparing environmental indicators to identify similarities and differences.

Germany’s Most and Least Similar Countries

The analysis shows that the five countries most similar to Germany in terms of electricity generation trends include Canada, France, and the UK. On the other hand, the countries that differ the most include Malta, the Bahamas, and the U.S.

# A tibble: 6 × 3
# Groups:   country [6]
  country        iso_code euclidean_distance
  <chr>          <chr>                 <dbl>
1 Canada         CAN                    101.
2 France         FRA                    253.
3 United Kingdom GBR                   1141.
4 Italy          ITA                   1454.
5 Spain          ESP                   1513.
6 Mexico         MEX                   1571.
# A tibble: 6 × 3
# Groups:   country [6]
  country       iso_code euclidean_distance
  <chr>         <chr>                 <dbl>
1 Malta         MLT                   2744.
2 Bahamas       BHS                   2745.
3 Barbados      BRB                   2749.
4 Haiti         HTI                   2750.
5 Belize        BLZ                   2751.
6 United States USA                  15297.

Visualizing the Data

The accompanying chart highlights the electricity generation trends of Germany, Canada, and France using distinct colors, while other countries are represented in gray. This visual representation makes it easier to grasp how closely the trends align, and it serves as a practical tool for monitoring strategic changes.

Code
library(tidyverse)
library(showtext)
library(htmltools)
library(gghighlight)

showtext_auto()
showtext_opts(dpi = 600)

font_add_google(name = "Roboto", family = "Roboto")
font <- "Roboto"

title <- paste0(
  "<span>Highlighting Countries Similar to<span style='color:#6929c4;'> Germany</span></span>"
)

subtitle <- paste0(
  "<span>Electric Generation - Terawatt hours (2000-2020)</span>"
)

similar_countries_highlighted_plot <- owid_energy %>%
  filter(iso_code %in% countries | iso_code == 'DEU') %>%
  filter(year > 2000 & year < 2021 & iso_code != 'USA') %>%
  ggplot(aes(x = year, y = electricity_generation, group = iso_code, color = iso_code)) +
  geom_line() +
  gghighlight(iso_code %in% c('DEU', 'FRA', 'CAN'), 
              use_direct_label = FALSE) +
  labs(
    title = title,
    subtitle = subtitle,
    y = "Terawatt hours (TWh)",
    x = "Year"
  ) +
    scale_color_manual(
    values = c('DEU' = '#6929c4', 'FRA' = '#1192e8', 'CAN' = '#198038', 'Other' = '#D3D3D3'),
    labels = c('DEU' = 'Germany', 'FRA' = 'France', 'CAN' = 'Canada')
  ) +
  scale_y_continuous(breaks = seq(0, 1100, by = 250)) +
  theme_void() +
  theme(
    legend.position = "right",
    legend.title = element_blank(),
    axis.text = element_text(
      family = font,
      size = 13
    ),
    axis.title = element_text(
      family = font,
      size = 13
    ),
    axis.title.x = element_text(
      margin = margin(7,0,0,0,"mm")
    ),
    axis.title.y = element_text(
      angle = 90,
      margin = margin(0,7,0,0,"mm")
    ),
    panel.grid.major = element_line(colour = "#e0e0e0", linewidth = 0.1),
    legend.text = element_text(
      family = font,
      size = 15
    ),
    plot.title = ggtext::element_textbox_simple(
      family = font,
      size = 20,
      margin = margin(10,0,0,0)
    ),
    plot.subtitle = ggtext::element_textbox_simple(
      family = font,
      size = 15,
      margin = margin(10,0,0,0)
    ),
    plot.margin = margin(5,5,5,5, "mm")
  )

 
Cookie Preferences