read_csv("derived_data/task1_results.csv")
## Rows: 50 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): dimension, side_length, estimated_clusters
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 50 × 3
## dimension side_length estimated_clusters
## <dbl> <dbl> <dbl>
## 1 6 10 6
## 2 6 9 6
## 3 6 8 6
## 4 6 7 6
## 5 6 6 6
## 6 6 5 6
## 7 6 4 6
## 8 6 3 1
## 9 6 2 1
## 10 6 1 1
## # ℹ 40 more rows
knitr::include_graphics("figures/task1_plot.png")
For each dimension n, the Gap Statistic method reliably estimates the correct number of clusters when the centers are sufficiently far apart. The estimated number of clusters remains close to the true number n when the side length is large. We observe that the estimated number of clusters begins to drop below the true number of clusters between side length 3 and 4.
# Test the function
read_csv("derived_data/task2_test.csv")
## Rows: 400 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): x, y, z, shell
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 400 × 4
## x y z shell
## <dbl> <dbl> <dbl> <dbl>
## 1 -0.270 1.12 0.236 1
## 2 0.287 -1.17 -0.427 1
## 3 -1.03 0.663 -0.0279 1
## 4 0.376 -0.340 1.10 1
## 5 1.07 -0.422 -0.0395 1
## 6 0.747 0.220 0.972 1
## 7 -0.645 -0.115 0.971 1
## 8 0.825 -0.661 0.236 1
## 9 -1.13 -0.379 -0.216 1
## 10 -0.916 0.256 -0.947 1
## # ℹ 390 more rows
# Interactive 3D scatter plot with plotly
knitr::include_url("figures/task2_testplot.html")
read_csv("derived_data/task2_simulation.csv")
## Rows: 11 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): max_radius, est_clusters
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 11 × 2
## max_radius est_clusters
## <dbl> <dbl>
## 1 10 2
## 2 9 2
## 3 8 2
## 4 7 5
## 5 6 4
## 6 5 1
## 7 4 1
## 8 3 1
## 9 2 1
## 10 1 1
## 11 0 1
knitr::include_graphics("figures/task2_plot.png")
The algorithm doesn’t
consistently detect the correct number of clusters. As the maximum
radius decreases, the number of clusters identified decreases as well.
The failure point is around radius of 4, where the shells starts
overlapping and the algorithm can’t distinguish between them anymore. If
we were to decrease the threshold, the algorithm would fail earlier
because the adjacency graph would become disconnected sooner. If we were
to increase the threshold, the algorithm could do a better job at
detecting the correct number of clusters but risks merging shells if set
too high.