Understanding the similarity between data points is crucial for effective analysis. Euclidean distance measures the separation between two sets of values by summing the squares of the differences between corresponding elements. This straightforward method helps reveal patterns and supports informed decision-making.
Example
In this demo, I applied Euclidean distance to compare electricity generation across countries using the TidyTuesday 06/06/2023 dataset. By selecting Germany as the target country, I identified which countries exhibit the most similar and most different trends.
Comparing similar entities by analyzing a metric across a dimension is a versatile technique with broad real-world applications. It’s a useful way to establish a baseline and explore the impact of changes. Here’s why this method is valuable:
Establishing Baselines for Experimental Testing: Comparing similar entities helps set a clear baseline of normal or expected performance. This baseline is essential when testing new interventions or experimental changes, as it allows for a precise measurement of their impact.
Assessing Market & Policy Impact: Whether in business, public policy, or other fields, comparing key metrics across similar groups reveals how well certain interventions or strategies perform against a comparable standard.
Tracking Trends: By analyzing whether variations in a metric are isolated or part of a broader pattern, this approach distinguishes between widespread trends and localized anomalies, supporting more accurate forecasting and planning.
Broad Applicability: Although this demo focuses on comparing countries, the methodology can be applied across various domains, such as benchmarking, evaluating customer segments, or comparing environmental indicators to identify similarities and differences.
Germany’s Most and Least Similar Countries
The analysis shows that the five countries most similar to Germany in terms of electricity generation trends include Canada, France, and the UK. On the other hand, the countries that differ the most include Malta, the Bahamas, and the U.S.
# A tibble: 6 × 3
# Groups: country [6]
country iso_code euclidean_distance
<chr> <chr> <dbl>
1 Canada CAN 101.
2 France FRA 253.
3 United Kingdom GBR 1141.
4 Italy ITA 1454.
5 Spain ESP 1513.
6 Mexico MEX 1571.
# A tibble: 6 × 3
# Groups: country [6]
country iso_code euclidean_distance
<chr> <chr> <dbl>
1 Malta MLT 2744.
2 Bahamas BHS 2745.
3 Barbados BRB 2749.
4 Haiti HTI 2750.
5 Belize BLZ 2751.
6 United States USA 15297.
Visualizing the Data
The accompanying chart highlights the electricity generation trends of Germany, Canada, and France using distinct colors, while other countries are represented in gray. This visual representation makes it easier to grasp how closely the trends align, and it serves as a practical tool for monitoring strategic changes.
The binomial trend detection method offers an alternative to traditional rolling averages by using week-over-week (WoW) comparisons to detect significant changes quickly.
Using a 14-Day Rolling Window: Smoothing out short-term fluctuations.
Counting Positive and Negative Changes: Tallying the days with consistent directional shifts.
Applying Statistical Analysis: Using the binomial distribution (for example, requiring 12 out of 14 days to exhibit the same trend) to confirm significant changes.
Process Details
Step
Details
Calculate WoW Deltas
Determine the day-to-day changes in the metric.
Rolling 14-Day Window
Apply a two-week window to balance sensitivity and stability.
Count Positive/Negative Days
Tally days with consistent positive or negative changes.
Binomial Distribution
Model the data with an assumption of a 50% chance for each day’s outcome.
Trend Threshold
Flag a trend if the count meets a predefined threshold (e.g., 12 out of 14 days).
binom.test(12, 14, 1/2, alternative ="greater")
Exact binomial test
data: 12 and 14
number of successes = 12, number of trials = 14, p-value = 0.00647
alternative hypothesis: true probability of success is greater than 0.5
95 percent confidence interval:
0.6146103 1.0000000
sample estimates:
probability of success
0.8571429
This binomial approach improves upon rolling averages by offering better responsiveness to recent trends and reducing sensitivity to day-of-week variations, leading to more reliable trend detection.