Analysis Methods

A collection of notes

Euclidean Distance

Understanding the similarity between data points is crucial for effective analysis. Euclidean distance measures the separation between two sets of values by summing the squares of the differences between corresponding elements. This straightforward method helps reveal patterns and supports informed decision-making.

Example

In this demo, I applied Euclidean distance to compare electricity generation across countries using the TidyTuesday 06/06/2023 dataset. By selecting Germany as the target country, I identified which countries exhibit the most similar and most different trends.

Code
library(tidyverse)

owid_energy <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-06-06/owid-energy.csv')

countries <- c(
  "CAN", "MEX", "BLZ", "CRI", "SLV", "GTM", "HND",
  "NIC", "PAN", "BHS", "BRB", "CUB", "DOM", "HTI",
  "JAM", "TTO","ALB", "AND", "ARM", "AUT", "AZE",
  "BLR", "BEL", "BIH", "BGR", "HRV", "CYP", "CZE",
  "DNK", "EST", "FIN", "FRA", "GEO", "USA", "GRC",
  "HUN", "ISL", "IRL", "ITA", "KAZ", "LVA", "LIE",
  "LTU", "LUX", "MLT", "MDA", "MCO", "MNE", "NLD",
  "MKD", "NOR", "POL", "PRT", "ROU", "RUS", "SMR",
  "SRB", "SVK", "SVN", "ESP", "SWE", "CHE", "UKR",
  "GBR", "VAT"
)

electricity_generation <- owid_energy %>%
  select(country, iso_code, year, electricity_generation) %>%
  filter(year > 2000 & iso_code %in% countries)

target_country <- "DEU" # Germany's 3 letter ISO country code

target_country_generation <- owid_energy %>%
  filter(year > 2000 & year < 2021 & iso_code == target_country) %>%
  select(year, electricity_generation) %>%
  rename(target_generation = electricity_generation)

countries_with_similarity_score <- electricity_generation %>%
  left_join(target_country_generation, by = "year") %>%
  group_by(country, iso_code) %>%
  summarize(euclidean_distance = sqrt(sum((electricity_generation - target_generation)^2, na.rm = TRUE))) %>%
  arrange(euclidean_distance)

Why Compare Similar Entities?

Comparing similar entities by analyzing a metric across a dimension is a versatile technique with broad real-world applications. It’s a useful way to establish a baseline and explore the impact of changes. Here’s why this method is valuable:

  • Establishing Baselines for Experimental Testing: Comparing similar entities helps set a clear baseline of normal or expected performance. This baseline is essential when testing new interventions or experimental changes, as it allows for a precise measurement of their impact.
  • Assessing Market & Policy Impact: Whether in business, public policy, or other fields, comparing key metrics across similar groups reveals how well certain interventions or strategies perform against a comparable standard.
  • Tracking Trends: By analyzing whether variations in a metric are isolated or part of a broader pattern, this approach distinguishes between widespread trends and localized anomalies, supporting more accurate forecasting and planning.
  • Broad Applicability: Although this demo focuses on comparing countries, the methodology can be applied across various domains, such as benchmarking, evaluating customer segments, or comparing environmental indicators to identify similarities and differences.

Germany’s Most and Least Similar Countries

The analysis shows that the five countries most similar to Germany in terms of electricity generation trends include Canada, France, and the UK. On the other hand, the countries that differ the most include Malta, the Bahamas, and the U.S.

# A tibble: 6 × 3
# Groups:   country [6]
  country        iso_code euclidean_distance
  <chr>          <chr>                 <dbl>
1 Canada         CAN                    101.
2 France         FRA                    253.
3 United Kingdom GBR                   1141.
4 Italy          ITA                   1454.
5 Spain          ESP                   1513.
6 Mexico         MEX                   1571.
# A tibble: 6 × 3
# Groups:   country [6]
  country       iso_code euclidean_distance
  <chr>         <chr>                 <dbl>
1 Malta         MLT                   2744.
2 Bahamas       BHS                   2745.
3 Barbados      BRB                   2749.
4 Haiti         HTI                   2750.
5 Belize        BLZ                   2751.
6 United States USA                  15297.

Visualizing the Data

The accompanying chart highlights the electricity generation trends of Germany, Canada, and France using distinct colors, while other countries are represented in gray. This visual representation makes it easier to grasp how closely the trends align, and it serves as a practical tool for monitoring strategic changes.

Code
library(tidyverse)
library(showtext)
library(htmltools)
library(gghighlight)

showtext_auto()
showtext_opts(dpi = 600)

font_add_google(name = "Roboto", family = "Roboto")
font <- "Roboto"

title <- paste0(
  "<span>Highlighting Countries Similar to<span style='color:#6929c4;'> Germany</span></span>"
)

subtitle <- paste0(
  "<span>Electric Generation - Terawatt hours (2000-2020)</span>"
)

similar_countries_highlighted_plot <- owid_energy %>%
  filter(iso_code %in% countries | iso_code == 'DEU') %>%
  filter(year > 2000 & year < 2021 & iso_code != 'USA') %>%
  ggplot(aes(x = year, y = electricity_generation, group = iso_code, color = iso_code)) +
  geom_line() +
  gghighlight(iso_code %in% c('DEU', 'FRA', 'CAN'), 
              use_direct_label = FALSE) +
  labs(
    title = title,
    subtitle = subtitle,
    y = "Terawatt hours (TWh)",
    x = "Year"
  ) +
    scale_color_manual(
    values = c('DEU' = '#6929c4', 'FRA' = '#1192e8', 'CAN' = '#198038', 'Other' = '#D3D3D3'),
    labels = c('DEU' = 'Germany', 'FRA' = 'France', 'CAN' = 'Canada')
  ) +
  scale_y_continuous(breaks = seq(0, 1100, by = 250)) +
  theme_void() +
  theme(
    legend.position = "right",
    legend.title = element_blank(),
    axis.text = element_text(
      family = font,
      size = 13
    ),
    axis.title = element_text(
      family = font,
      size = 13
    ),
    axis.title.x = element_text(
      margin = margin(7,0,0,0,"mm")
    ),
    axis.title.y = element_text(
      angle = 90,
      margin = margin(0,7,0,0,"mm")
    ),
    panel.grid.major = element_line(colour = "#e0e0e0", linewidth = 0.1),
    legend.text = element_text(
      family = font,
      size = 15
    ),
    plot.title = ggtext::element_textbox_simple(
      family = font,
      size = 20,
      margin = margin(10,0,0,0)
    ),
    plot.subtitle = ggtext::element_textbox_simple(
      family = font,
      size = 15,
      margin = margin(10,0,0,0)
    ),
    plot.margin = margin(5,5,5,5, "mm")
  )

Binomial Trend Detection

Overview

The binomial trend detection method offers an alternative to traditional rolling averages by using week-over-week (WoW) comparisons to detect significant changes quickly.

Methodology

This approach involves:

  • Calculating Week-over-Week Deltas: Measuring daily changes.
  • Using a 14-Day Rolling Window: Smoothing out short-term fluctuations.
  • Counting Positive and Negative Changes: Tallying the days with consistent directional shifts.
  • Applying Statistical Analysis: Using the binomial distribution (for example, requiring 12 out of 14 days to exhibit the same trend) to confirm significant changes.

Process Details

Step Details
Calculate WoW Deltas Determine the day-to-day changes in the metric.
Rolling 14-Day Window Apply a two-week window to balance sensitivity and stability.
Count Positive/Negative Days Tally days with consistent positive or negative changes.
Binomial Distribution Model the data with an assumption of a 50% chance for each day’s outcome.
Trend Threshold Flag a trend if the count meets a predefined threshold (e.g., 12 out of 14 days).
binom.test(12, 14, 1/2, alternative = "greater")

    Exact binomial test

data:  12 and 14
number of successes = 12, number of trials = 14, p-value = 0.00647
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
 0.6146103 1.0000000
sample estimates:
probability of success 
             0.8571429 
Code
library(ggplot2)
library(showtext)
library(ggtext)

showtext_auto()
showtext_opts(dpi = 300)

font_add_google(name = "Roboto", family = "Roboto")
font_1 <- "Roboto"

n <- 14 
p <- 1/2  
x_obs <- 12  

binom_test <- binom.test(x_obs, n, p, alternative = "greater")

p_value <- binom_test$p.value
conf_int <- binom_test$conf.int

p_ge_x_obs <- sum(dbinom(x_obs:n, size = n, prob = p))

data <- data.frame(
  x = 0:n,
  probability = dbinom(0:n, size = n, prob = p),
  color = ifelse(0:n >= x_obs, "#C0392B", "#30394F")  
)

binom_visual <- ggplot(data, aes(x = x, y = probability, fill = color)) +
  geom_bar(stat = "identity") +
  scale_fill_identity() +  
  scale_x_continuous(breaks = 0:n) +  
  labs(
    title = "Binomial Distribution (n = 14, p = 0.5)",
    x = "Number of Successes",
    y = "Probability"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    plot.margin = margin(10, 10, 10, 10, "mm"),
    axis.text = element_text(family = font_1, size = 7),
    axis.title = element_text(family = font_1, size = 7),
    axis.title.x = element_text(margin = margin(5, 0, 0, 0, 'mm')),
    axis.title.y = element_text(margin = margin(0, 5, 0, 0, 'mm')),
    plot.title = element_text(family = font_1, size = 10)
  ) +
  annotate(
    geom = 'richtext',
    x = n+1,
    y = max(data$probability) * 0.9,
    label = paste0(
      "<span style='color:#C0392B; font-size:8pt;font-family:Roboto;'>",
      "Threshold ≥ 12 trials (days)", "<br>",
      "P(X ≥ ", x_obs, ") = ", round(p_ge_x_obs, 4), "<br>",
      "95% CI = [", round(conf_int[1], 4), ", ", round(conf_int[2], 4), "]</span>"),
    hjust = 1, fill = NA, label.color = NA
  )

This binomial approach improves upon rolling averages by offering better responsiveness to recent trends and reducing sensitivity to day-of-week variations, leading to more reliable trend detection.