首页 文章

计算矩阵中所有其他点之间的距离

提问于
浏览
5

我是Python的新手,我需要实现一个聚类算法 . 为此,我需要计算给定输入数据之间的距离 .

考虑以下输入数据 -

[[1,2,8],
     [7,4,2],
     [9,1,7],
     [0,1,5],
     [6,4,3]]

我想在这里实现的是,我想计算[1,2,8]与所有其他点的距离,并找到距离最小的点 .

我必须为所有其他要点重复这一点 .

我试图用FOR循环实现它,但我确信SciPy / NumPy必须有一个函数可以帮助我有效地实现这个结果 .

我在线查看,但'pdist'命令无法完成我的工作 .

有人可以指导我吗?

TIA

4 回答

  • 6

    使用 np.linalg.norm 结合广播(numpy外部减法),你可以做到:

    np.linalg.norm(a - a[:,None], axis=-1)
    

    a[:,None] 将新轴插入 aa - a[:,None] 然后由于广播而逐行减法 . np.linalg.norm 计算最后一个轴上的 np.sqrt(np.sum(np.square(...)))


    a = np.array([[1,2,8],
         [7,4,2],
         [9,1,7],
         [0,1,5],
         [6,4,3]])
    
    np.linalg.norm(a - a[:,None], axis=-1)
    #array([[ 0.        ,  8.71779789,  8.1240384 ,  3.31662479,  7.34846923],
    #       [ 8.71779789,  0.        ,  6.164414  ,  8.18535277,  1.41421356],
    #       [ 8.1240384 ,  6.164414  ,  0.        ,  9.21954446,  5.83095189],
    #       [ 3.31662479,  8.18535277,  9.21954446,  0.        ,  7.        ],
    #       [ 7.34846923,  1.41421356,  5.83095189,  7.        ,  0.        ]])
    

    例如,元素 [0,1] ,_ [0,2] 对应于:

    np.sqrt(np.sum((a[0] - a[1]) ** 2))
    # 8.717797887081348
    
    np.sqrt(np.sum((a[0] - a[2]) ** 2))
    # 8.1240384046359608
    

    分别 .

  • 3

    这是使用SciPy's cdist的一种方法 -

    from scipy.spatial.distance import cdist
    def closest_rows(a):
        # Get euclidean distances as 2D array
        dists = cdist(a, a, 'sqeuclidean')
    
        # Fill diagonals with something greater than all elements as we intend
        # to get argmin indices later on and then index into input array with those
        # indices to get the closest rows
        dists.ravel()[::dists.shape[1]+1] = dists.max()+1
        return a[dists.argmin(1)]
    

    样品运行 -

    In [72]: a
    Out[72]: 
    array([[1, 2, 8],
           [7, 4, 2],
           [9, 1, 7],
           [0, 1, 5],
           [6, 4, 3]])
    
    In [73]: closest_rows(a)
    Out[73]: 
    array([[0, 1, 5],
           [6, 4, 3],
           [6, 4, 3],
           [1, 2, 8],
           [7, 4, 2]])
    

    Runtime test

    其他工作方法 -

    def norm_app(a): # @Psidom's soln
        dist = np.linalg.norm(a - a[:,None], axis=-1); 
        dist[np.arange(dist.shape[0]), np.arange(dist.shape[0])] = np.nan
        return a[np.nanargmin(dist, axis=0)]
    

    时间点 10,000 点 -

    In [79]: a = np.random.randint(0,9,(10000,3))
    
    In [80]: %timeit norm_app(a) # @Psidom's soln
    1 loop, best of 3: 3.83 s per loop
    
    In [81]: %timeit closest_rows(a)
    1 loop, best of 3: 392 ms per loop
    

    Further performance boost

    eucl_dist包(免责声明:我是它的作者),其中包含计算欧几里德距离的各种方法,这些方法比 SciPy's cdist 更有效,特别是对于大型数组 .

    因此,利用它,我们会有一个更高性能的,像这样 -

    from eucl_dist.cpu_dist import dist
    def closest_rows_v2(a):
        dists = dist(a,a, matmul="gemm", method="ext") 
        dists.ravel()[::dists.shape[1]+1] = dists.max()+1
        return a[dists.argmin(1)]
    

    计时 -

    In [162]: a = np.random.randint(0,9,(10000,3))
    
    In [163]: %timeit closest_rows(a)
    1 loop, best of 3: 394 ms per loop
    
    In [164]: %timeit closest_rows_v2(a)
    1 loop, best of 3: 229 ms per loop
    
  • 1

    我建议使用 pdistsquareform 来自 scipy.spatial.distance

    考虑以下几点:

    a = np.array([[1,2,8], [7,4,2], [9,1,7], [0,1,5], [6,4,3]])
    

    如果要在点 [1,2,8] 和其他点之间显示 all distances

    squareform(pdist(a))
    
    Out[1]: array([[ 0.        ,  8.71779789,  8.1240384 ,  3.31662479,  7.34846923],
                   [ 8.71779789,  0.        ,  6.164414  ,  8.18535277,  1.41421356],
                   [ 8.1240384 ,  6.164414  ,  0.        ,  9.21954446,  5.83095189],
                   [ 3.31662479,  8.18535277,  9.21954446,  0.        ,  7.        ],
                   [ 7.34846923,  1.41421356,  5.83095189,  7.        ,  0.        ]])
    

    我想在点 [1,2,8] 和最近点之间显示 shortest distance

    sorted(squareform(pdist(a))[0])[1]
    
    Out[2]: 3.3166247903553998
    

    [0] 是您第一点的索引( [1,2,8]

    [1] 是第二个最小值的索引(避免零)

    如果要显示 [1,2,8] 最近点的 index

    np.argsort(squareform(pdist(a))[0])[1]
    
    Out[3]: 3
    
  • 2

    From this thread's您可以在那里使用 e_dist 功能并获得相同的结果 .

    Addendum

    Timing :在我记忆匮乏的笔记本电脑上,我只能用比@Psidom使用他的 norm_app 函数更小的样本进行比较 .

    a = np.random.randint(0,9,(5000,3))

    %timeit norm_app(a)每循环1.91 s±13.5 ms(平均值±标准偏差,7次运行,每次1次循环)

    %timeit e_dist(a,a)每循环631 ms±3.64 ms(平均值±标准偏差,7次运行,每次循环1次)

    a 
    array([[1, 2, 8],
           [7, 4, 2],
           [9, 1, 7],
           [0, 1, 5],
           [6, 4, 3]])
    
    dm = e_dist(a, a)  # get the def from the link
    
    dm
    Out[7]: 
    array([[ 0.  ,  8.72,  8.12,  3.32,  7.35],
           [ 8.72,  0.  ,  6.16,  8.19,  1.41],
           [ 8.12,  6.16,  0.  ,  9.22,  5.83],
           [ 3.32,  8.19,  9.22,  0.  ,  7.  ],
           [ 7.35,  1.41,  5.83,  7.  ,  0.  ]])
    
    idx = np.argsort(dm)
    
    closest = a[idx]
    
    closest
    Out[10]: 
    array([[[1, 2, 8],
            [0, 1, 5],
            [6, 4, 3],
            [9, 1, 7],
            [7, 4, 2]],
    
           [[7, 4, 2],
            [6, 4, 3],
            [9, 1, 7],
            [0, 1, 5],
            [1, 2, 8]],
    
           [[9, 1, 7],
            [6, 4, 3],
            [7, 4, 2],
            [1, 2, 8],
            [0, 1, 5]],
    
           [[0, 1, 5],
            [1, 2, 8],
            [6, 4, 3],
            [7, 4, 2],
            [9, 1, 7]],
    
           [[6, 4, 3],
            [7, 4, 2],
            [9, 1, 7],
            [0, 1, 5],
            [1, 2, 8]]])
    

相关问题