Task 1

read_csv("derived_data/task1_results.csv")

## Rows: 50 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): dimension, side_length, estimated_clusters
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 50 × 3
##    dimension side_length estimated_clusters
##        <dbl>       <dbl>              <dbl>
##  1         6          10                  6
##  2         6           9                  6
##  3         6           8                  6
##  4         6           7                  6
##  5         6           6                  6
##  6         6           5                  6
##  7         6           4                  6
##  8         6           3                  1
##  9         6           2                  1
## 10         6           1                  1
## # ℹ 40 more rows

knitr::include_graphics("figures/task1_plot.png")

For each dimension n, the Gap Statistic method reliably estimates the correct number of clusters when the centers are sufficiently far apart. The estimated number of clusters remains close to the true number n when the side length is large. We observe that the estimated number of clusters begins to drop below the true number of clusters between side length 3 and 4.

Task 2

# Test the function
read_csv("derived_data/task2_test.csv")

## Rows: 400 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (4): x, y, z, shell
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 400 × 4
##         x      y       z shell
##     <dbl>  <dbl>   <dbl> <dbl>
##  1 -0.270  1.12   0.236      1
##  2  0.287 -1.17  -0.427      1
##  3 -1.03   0.663 -0.0279     1
##  4  0.376 -0.340  1.10       1
##  5  1.07  -0.422 -0.0395     1
##  6  0.747  0.220  0.972      1
##  7 -0.645 -0.115  0.971      1
##  8  0.825 -0.661  0.236      1
##  9 -1.13  -0.379 -0.216      1
## 10 -0.916  0.256 -0.947      1
## # ℹ 390 more rows

# Interactive 3D scatter plot with plotly
knitr::include_url("figures/task2_testplot.html")

read_csv("derived_data/task2_simulation.csv")

## Rows: 11 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): max_radius, est_clusters
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 11 × 2
##    max_radius est_clusters
##         <dbl>        <dbl>
##  1         10            2
##  2          9            2
##  3          8            2
##  4          7            5
##  5          6            4
##  6          5            1
##  7          4            1
##  8          3            1
##  9          2            1
## 10          1            1
## 11          0            1

knitr::include_graphics("figures/task2_plot.png")

The algorithm doesn’t consistently detect the correct number of clusters. As the maximum radius decreases, the number of clusters identified decreases as well. The failure point is around radius of 4, where the shells starts overlapping and the algorithm can’t distinguish between them anymore. If we were to decrease the threshold, the algorithm would fail earlier because the adjacency graph would become disconnected sooner. If we were to increase the threshold, the algorithm could do a better job at detecting the correct number of clusters but risks merging shells if set too high.

BIOS 611: Clustering HW

Nathalie Blasco

2025-10-22

Task 1

Task 2