# 清风的技术小屋

0%

## 介绍

• 按需选择
• 观察法
• 手肘法
• Gap Statistic方法

## 手肘法

### Trick

Note: 以下为公式推导内容，可略过。将 $$D_k$$ 表达式分解可得： \begin{align} & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 \\ = & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert (x_i - \mu_k) - (x_j - \mu_k) \Vert^2 \\ = & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left( \Vert x_i - \mu_k \Vert^2 + \Vert x_j - \mu_k \Vert^2 - 2 \left< x_i - \mu_k, x_j - \mu_k \right> \right) \\ = & \sum_{x_i \in C_k} n_k \Vert x_i - \mu_k \Vert^2 + \sum_{x_j \in C_k} n_k \Vert x_j - \mu_k \Vert^2 - 2 \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \\ = & 2 \sum_{x_i \in C_k} n_k \Vert x_i - \mu_k \Vert^2 - 2 \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \end{align} 其中 \begin{align} & \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> \\ = & \sum_{x_i \in C_k} \left< x_i - \mu_k, \sum_{x_j \in C_k} (x_j - \mu_k) \right> \end{align} 由于 $\sum_{x_j \in C_k} (x_j - \mu_k) = \sum_{x_j \in C_k} x_j - n_k \mu_k = \; n_k \mu_k - n_k \mu_k = 0$$\sum_{x_i \in C_k} \sum_{x_j \in C_k} \left< x_i - \mu_k, x_j - \mu_k \right> = 0$ 因此可以得到 $\sum_{x_i \in C_k} \sum_{x_j \in C_k} \Vert x_i - x_j \Vert^2 = 2 n_k \sum_{x_i \in C_k} \Vert x_i - \mu_k \Vert^2$

## Gap Statistic方法

1. cluster the observed data, varying the total number of clusters from $$K=1, 2, \ldots, K_{\text{max}}$$, and giving within-dispersion measures $$W_K, K=1, 2, \ldots, K_{\text{max}}$$

2. generate $$B$$ reference data sets and cluster each one giving within-dispersion measures $$W_{Kb}$$, $$b=1,2, \ldots, B$$, $$K=1, 2, \ldots, K_{\text{max}}$$. Compute the gap statistic $\text{Gap}(K) =\frac{1}{B} \sum_{b=1}^B \log(W_{Kb}) - \log(W_K)$

3. let $$\bar{w} = \frac{1}{B} \sum_{b=1}^B \log(W_{Kb})$$, compute the standard deviation $\text{sd}_K = \left[ \frac{1}{B} \sum_{b=1}^B (\log(W_{Kb}) - \bar{w})^2 \right]^{\frac{1}{2}}$ and define $$s_K = \text{sd}_K \sqrt{1 + \frac{1}{B}}$$

4. choose the number of clusters as the smallest $$K$$ such that $$\text{Gap(K)} \ge \text{Gap(K+1)} - s_{K+1}$$

## 参考

1. K-means怎么选K?
2. k-means的k值该如何确定？
3. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic[J]. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2001, 63(2): 411-423.