首页 文章

群集后的群集分配问题

提问于
浏览
0

我有一个问题,了解k-means聚类中的集群分配 . 具体来说,我知道该点被分配到最近的簇(到簇中心的最短距离),但我无法重现结果 . 详情如下 .

假设我有一个数据框 df1

set.seed(16)
df1 = data.frame(matrix(sample(1:50, replace = T), ncol=10, nrow=10000))
head(df1, n=4)

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 35 35 35 35 35 35 35 35 35  35
2 13 13 13 13 13 13 13 13 13  13
3 23 23 23 23 23 23 23 23 23  23
4 12 12 12 12 12 12 12 12 12  12

在那个数据框架上,我想执行k-means聚类(带缩放):

for_clst_km = scale(df1, center=F) #standardization with z-scores

kclust = 6 #number of clusters
Clusters <- kmeans(for_clst_km, kclust)

聚类完成后,我可以将聚类分配给原始数据帧:

df1$cluster = Clusters$cluster

出于测试目的,让我们选择3号集群 .

library(dplyr)
cluster3 = df1 %>% filter(cluster == 3)

因为我想首先扩展cluster3,我需要删除集群列,然后执行z标准化:

cluster3$cluster = NULL

cluster3_1 = (cluster3-colMeans(df1))/apply(df1,2,sd)

现在,当我在cluster3_1中缩放值时,我可以计算每个簇的中心点的距离:

centroids = data.matrix(Clusters$centers)

dist_to_clust1 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[1,])^2)))
dist_to_clust2 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[2,])^2)))
dist_to_clust3 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[3,])^2)))
dist_to_clust4 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[4,])^2)))
dist_to_clust5 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[5,])^2)))
dist_to_clust6 = apply(cluster3_1, 1, function(x) sqrt(sum((x-centroids[6,])^2)))

dist_to_clust = cbind(dist_to_clust1, dist_to_clust2, dist_to_clust3, dist_to_clust4, dist_to_clust5, dist_to_clust6)

最后,在观察到每个星团的距离之后,很明显我做错了什么 . 例如,查看 fifth row 我看到该点最接近 cluster 4 (例如,这是最小的值) .

head(dist_to_clust)

     dist_to_clust1 dist_to_clust2 dist_to_clust3 dist_to_clust4 dist_to_clust5 dist_to_clust6
[1,]      11.015929      11.116591      10.946547      11.173597      11.034535      10.968986
[2,]      13.136060      12.848511      12.967084      13.379930      12.840414      12.861085
[3,]      13.681588      13.314994      13.492713      13.942535      13.322293      13.360695
[4,]      10.506083      10.725233      10.467843      10.636465      10.621233      10.529714
[5,]       2.157906       5.392285       3.120574       1.168265       4.855553       4.197457
[6,]      11.015929      11.116591      10.946547      11.173597      11.034535      10.968986

我认为缩放方法存在错误 . 我不确定我是否可以使用整个数据框的均值和标准偏差来实际扩展簇3点 .

你能分享一下你的想法吗,我做错了什么?非常感谢你!

2 回答

  • 0

    从我在交叉验证的答案:


    这是因为 df-colmeans(df) 没有按照你的想法行事 .

    我们试试代码吧:

    a=matrix(1:9,nrow=3)
    
         [,1] [,2] [,3]
    [1,]    1    4    7
    [2,]    2    5    8
    [3,]    3    6    9
    
    colMeans(a)
    
    [1] 2 5 8
    
    a-colMeans(a)
    
         [,1] [,2] [,3]
    [1,]   -1    2    5
    [2,]   -3    0    3
    [3,]   -5   -2    1
    
    apply(a,2,function(x) x-mean(x))
    
         [,1] [,2] [,3]
    [1,]   -1   -1   -1
    [2,]    0    0    0
    [3,]    1    1    1
    

    你会发现 a-colMeans(a)apply(a,2,function(x) x-mean(x)) 做了不同的事情,这就是你想要居中的东西 .

    您可以编写 apply 来为您执行完整的自动缩放:

    apply(a,2,function(x) (x-mean(x))/sd(x))
    
         [,1] [,2] [,3]
    [1,]   -1   -1   -1
    [2,]    0    0    0
    [3,]    1    1    1
    
    scale(a)
    
         [,1] [,2] [,3]
    [1,]   -1   -1   -1
    [2,]    0    0    0
    [3,]    1    1    1
    attr(,"scaled:center")
    [1] 2 5 8
    attr(,"scaled:scale")
    [1] 1 1 1
    

    但是这样做没有意义,因为 scale 会为你做这件事 . :)


    而且,尝试聚类:

    set.seed(16)
    nc=10
    nr=10000
    # Make sure you draw enough samples: There was extreme periodicity in your sampling
    df1 = matrix(sample(1:50, size=nr*nc,replace = T), ncol=nc, nrow=nr)
    head(df1, n=4)
    
    for_clst_km = scale(df1) #standardization with z-scores
    nclust = 4 #number of clusters
    Clusters <- kmeans(for_clst_km, nclust)
    
    # For extracting scaled values: They are already available in for_clst_km
    cluster3_sc=for_clst_km[Clusters$cluster==3,]
    
    # Simplify code by putting distance in function
    distFun=function(mat,centre) apply(mat, 1, function(x) sqrt(sum((x-centre)^2)))
    
    centroids=Clusters$centers
    dists=matrix(nrow=nrow(cluster3_sc),ncol=nclust) # Allocate matrix
    for(d in 1:nclust) dists[,d]=distFun(cluster3_sc,centroids[d,])  # Calculate observation distances to centroid d=1..nclust
    
    whichMins=apply(dists,1,which.min) # Calculate the closest centroid per observation
    table(whichMins) # Tabularize
    
    > table(whichMins)
    whichMins
       3 
    2532
    

    HTH HAND,
    卡尔

  • 1

    您的手写缩放代码已被破坏 . 检查结果数据的标准偏差,它不是1 .

    你为什么不用它

    cluster3 = for_clst_km %>% filter(cluster == 3)
    

相关问题