首页 文章

聚类中的大距离矩阵

提问于
浏览
4

我在一台16 GB RAM的机器上运行R 3.2.3 . 我有一个3,00,000行×12列的大矩阵 . 我想在R中使用层次聚类算法,所以在我这样做之前,我正在尝试创建一个距离矩阵 . 由于数据是混合类型,我使用不同类型的不同矩阵 . 我收到有关内存分配的错误:

df <- as.data.frame(matrix(rnorm(36*10^5), nrow = 3*10^5))
d1=as.dist(distm(df[,c(1:2)])/10^5)
d2=dist(df[,c(3:8)], method = "euclidean") 
d3= hamming.distance(df[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)

我收到以下错误

> d1=as.dist(distm(df1[,c(1:2)])/10^5)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, ncol = n, nrow = n) :
Reached total allocation of 16070Mb: see help(memory.size)
> d2=dist(df1[,c(3:8)], method = "euclidean") 
Error: cannot allocate vector of size 335.3 Gb
In addition: Warning messages:
1: In dist(df1[, c(3:8)], method = "euclidean") :
 Reached total allocation of 16070Mb: see help(memory.size)
2: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
3: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
4: In dist(df1[, c(3:8)], method = "euclidean") :
Reached total allocation of 16070Mb: see help(memory.size)
> d3= hamming.distance(df1[,c(9:12)]%>%as.matrix(.))%>%as.dist(.)
Error: cannot allocate vector of size 670.6 Gb
In addition: Warning messages:
1: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
2: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
3: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)
4: In matrix(0, nrow = nrow(x), ncol = nrow(x)) :
Reached total allocation of 16070Mb: see help(memory.size)

1 回答

  • 3

    简单来说,假设您有1行(A)到3 ^ 8矩阵(B)的最小距离 .

    最初的方法是:

    1. load A and B
    2. distance compute A with each row of B
    3. select smallest one from results (reduction)
    

    但由于B非常大,因此无法在执行期间将其加载到内存或出错 .

    批量方法将是这样的:

    1. load A (suppose it is small)
    2. load B.partial with 1 to 1^5 rows of B
    3. compute distance of A with each row of B.partial
    4. select min one in partial results and save it as res[i]
    5. go back 2.) load next 1^5 rows of B 
    6. final your got a 3000 partial results and saved in res[1:3000]
    7. reduction : select min one from res[1:3000]
       note: if you need all distances as `dist` function, you don't need reduction and just keep this array.
    

    代码将比原始代码复杂一点 . 但是当我们处理大数据问题时,这是非常常见的技巧 . 对于计算部件,您可以在here中参考我之前的一个答案 .

    如果您可以在此处粘贴批处理模式的最终代码,我将非常合适 . 这样其他人也可以学习 .


    Another interesting things about dist 是R包中支持openMP的少数几个 . 请参阅here中的源代码以及如何使用here中的openMP进行编译 .

    因此,如果您可以尝试根据您的机器设置 OMP_NUM_THREADS 为4或8,然后再次运行,您可以看到性能提升很多!

    void R_distance(double *x, int *nr, int *nc, double *d, int *diag,
        int *method, double *p)
    {
         int dc, i, j;
         size_t  ij;  /* can exceed 2^31 - 1 */
         double (*distfun)(double*, int, int, int, int) = NULL;
         #ifdef _OPENMP
            int nthreads;
         #endif
         .....
     }
    

    此外,如果您想通过GPU加速 dist ,您可以参考ParallelR中的 talk 部分 .

相关问题