首页 文章

模糊分组,对相似词进行分组

提问于
浏览
4

这个问题在此之前被问到过

What is a good strategy to group similar words?

但没有明确的答案如何“分组”项目 . 基于difflib的解决方案基本上是搜索,对于给定项目,difflib可以从列表中返回最相似的单词 . 但是如何将它用于分组呢?

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

['ape', 'appel', 'peach', 'puppy']

要么

['ape', 'apple', 'peach', 'puppy']

我尝试过的一个想法是,对于每个项目,遍历列表,如果get_close_matches返回多个匹配,则使用它,如果不保持单词的原样 . 这部分工作,但它可以建议苹果的上诉,然后申请苹果,这些话只会切换位置,没有任何改变 .

我会很感激任何指针,图书馆名称等 .

注意:同样在性能方面,我们有300,000个项目的列表,get_close_matches似乎有点慢 . 有谁知道那里有基于C /的解决方案?

谢谢,

注意:进一步调查显示kmedoid是正确的算法(以及层次聚类),因为kmedoid不需要“中心”,它需要/使用数据点本身作为中心(这些点称为medoids,因此名称) . 在单词分组的情况下,medoid将是该组/集群的代表元素 .

5 回答

  • 3

    这是使用Affinity Propagation算法的另一个版本 .

    import numpy as np
    import scipy.linalg as lin
    import Levenshtein as leven
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.cluster import AffinityPropagation
    import itertools
    
    words = np.array(
        ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
         'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
         'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
         'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
         'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
         'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
         'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
         'people', 'into', 'year', 'your', 'good', 'some', 'could',
         'them', 'see', 'other', 'than', 'then', 'now', 'look',
         'only', 'come', 'its', 'over', 'think', 'also', 'back',
         'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
         'way', 'even', 'new', 'want', 'because', 'any', 'these',
         'give', 'day', 'most', 'us'])
    
    print "calculating distances..."
    
    (dim,) = words.shape
    
    f = lambda (x,y): -leven.distance(x,y)
    
    res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
    A = np.reshape(res,(dim,dim))
    
    af = AffinityPropagation().fit(A)
    cluster_centers_indices = af.cluster_centers_indices_
    labels = af.labels_
    
    unique_labels = set(labels)
    for i in unique_labels:
        print words[labels==i]
    

    距离必须转换为相似之处,我通过消除距离来做到这一点 . 输出是

    ['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
    ['it' 'with' 'at' 'if' 'get' 'its' 'first']
    ['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
    ['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
    ['this' 'his' 'which' 'him']
    ['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
    ['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
    ['would' 'could']
    ['but' 'out' 'about' 'our' 'most']
    ['make' 'like' 'time' 'take' 'back']
    ['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
     'even' 'these']
    ['not' 'no' 'know' 'now' 'how' 'new']
    ['will' 'people' 'year' 'well']
    ['say' 'what' 'way' 'want' 'day']
    ['because']
    ['as' 'up' 'just' 'use' 'us']
    
  • 5

    您需要规范化组 . 在每个组中,选择一个单词或代表该组的编码 . 然后按其代表分组 .

    一些可能的方法:

    • 选择第一个遇到的单词 .

    • 选择词典第一个单词 .

    • 为所有单词导出模式 .

    • 选择一个唯一索引 .

    • 使用soundex作为模式 .

    但是,对单词进行分组可能很困难 . 如果A类似于B,并且B类似于C,则A和C不一定彼此相似 . 如果B是代表,则A和C都可以包括在该组中 . 但如果A或C是代表,则另一个不能包括在内 .


    通过第一个替代(第一个遇到的单词):

    class Seeder:
        def __init__(self):
            self.seeds = set()
            self.cache = dict()
    
        def get_seed(self, word):
            LIMIT = 2
            seed = self.cache.get(word,None)
            if seed is not None:
                return seed
            for seed in self.seeds:
                if self.distance(seed, word) <= LIMIT:
                    self.cache[word] = seed
                    return seed
            self.seeds.add(word)
            self.cache[word] = word
            return word
    
        def distance(self, s1, s2):
            l1 = len(s1)
            l2 = len(s2)
            matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
            for zz in xrange(0,l2):
                for sz in xrange(0,l1):
                    if s1[sz] == s2[zz]:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                    else:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
            return matrix[l2][l1]
    
    import itertools
    
    def group_similar(words):
        seeder = Seeder()
        words = sorted(words, key=seeder.get_seed)
        groups = itertools.groupby(words, key=seeder.get_seed)
        return [list(v) for k,v in groups]
    

    Example:

    import pprint
    
    print pprint.pprint(group_similar([
        'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
        'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
        'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
        'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
        'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
        'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
        'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
        'people', 'into', 'year', 'your', 'good', 'some', 'could',
        'them', 'see', 'other', 'than', 'then', 'now', 'look',
        'only', 'come', 'its', 'over', 'think', 'also', 'back',
        'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
        'way', 'even', 'new', 'want', 'because', 'any', 'these',
        'give', 'day', 'most', 'us'
    ]), width=120)
    

    Output:

    [['after'],
     ['also'],
     ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
     ['back'],
     ['because'],
     ['but', 'about', 'get', 'just'],
     ['first'],
     ['from'],
     ['good', 'look'],
     ['have', 'make', 'give'],
     ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
     ['into'],
     ['know', 'new'],
     ['like', 'time', 'take'],
     ['most'],
     ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
     ['only'],
     ['over', 'our', 'even'],
     ['people'],
     ['say', 'she', 'way', 'day'],
     ['some', 'see', 'come'],
     ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
     ['think'],
     ['well'],
     ['what', 'who', 'when', 'than'],
     ['with', 'will', 'which'],
     ['work'],
     ['would', 'could'],
     ['year', 'your']]
    
  • 3

    这是一种基于medoids的方法 . 首先安装MlPy . 在Ubuntu上

    sudo apt-get install python-mlpy
    

    然后

    import numpy as np
    import mlpy
    
    class distance:    
        def compute(self, s1, s2):
            l1 = len(s1)
            l2 = len(s2)
            matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
            for zz in xrange(0,l2):
                for sz in xrange(0,l1):
                    if s1[sz] == s2[zz]:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                    else:
                        matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
            return matrix[l2][l1]
    
    x =  np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])
    
    km = mlpy.Kmedoids(k=3, dist=distance())
    medoids,clusters,a,b = km.compute(x)
    
    print medoids
    print clusters
    print a
    
    print x[medoids] 
    for i,c in enumerate(x[medoids]):
        print "medoid", c
        print x[clusters[a==i]]
    

    输出是

    [4 3 1]
    [0 2]
    [2 2]
    ['puppy' 'peach' 'appel']
    medoid puppy
    []
    medoid peach
    []
    medoid appel
    ['ape' 'apple']
    

    更大的单词列表并使用k = 10

    medoid he
    ['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
     'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
     'after' 'get' 'what' 'I']
    medoid out
    ['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
     'about']
    medoid to
    ['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
    medoid now
    ['new' 'how' 'know' 'not']
    medoid time
    ['like' 'take' 'come' 'some' 'give']
    medoid because
    []
    medoid an
    ['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
     'any']
    medoid look
    ['work' 'good']
    medoid will
    ['with' 'well' 'which']
    medoid then
    ['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
     'than' 'them']
    
  • 1

    您必须在封闭的匹配单词中决定要使用的单词 . 可以从列表中获取get_close_matches返回的第一个元素,或者只是在该列表上使用随机函数并从闭合匹配中获取一个元素 .

    必须有某种规则,因为它..

    In [19]: import difflib
    
    In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']
    
    In [21]: a = ['appel', 'apple', 'peach', 'puppy']
    
    In [22]: b = difflib.get_close_matches('ape',a)
    
    In [23]: b
    Out[23]: ['apple', 'appel']
    
    In [24]: import random
    
    In [25]: c = random.choice(b)
    
    In [26]: c
    Out[26]: 'apple'
    
    In [27]:
    

    现在从初始列表中删除c,就是这样...对于c,你可以使用Levenshtein_distance

  • 0

    另一种方法可以是使用SVD的矩阵分解 . 首先我们创建单词距离矩阵,对于100个单词,这将是100 x 100矩阵,表示从每个单词到所有其他单词的距离 . 然后,在这个矩阵上运行SVD,得到的u,s,v中的u可以看作每个簇的隶属度 .

    import numpy as np
    import scipy.linalg as lin
    import Levenshtein as leven
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    import itertools
    
    words = np.array(
        ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
         'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
         'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
         'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
         'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
         'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
         'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
         'people', 'into', 'year', 'your', 'good', 'some', 'could',
         'them', 'see', 'other', 'than', 'then', 'now', 'look',
         'only', 'come', 'its', 'over', 'think', 'also', 'back',
         'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
         'way', 'even', 'new', 'want', 'because', 'any', 'these',
         'give', 'day', 'most', 'us'])
    
    print "calculating distances..."
    
    (dim,) = words.shape
    
    f = lambda (x,y): leven.distance(x,y)
    res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
                    dtype=np.uint8)
    A = np.reshape(res,(dim,dim))
    
    print "svd..."
    
    u,s,v = lin.svd(A, full_matrices=False)
    
    print u.shape
    print s.shape
    print s
    print v.shape
    
    data = u[:,0:10]
    k=KMeans(init='k-means++', k=25, n_init=10)
    k.fit(data)
    centroids = k.cluster_centers_
    labels = k.labels_
    print labels
    
    for i in range(np.max(labels)):
        print words[labels==i]
    
    def dist(x,y):   
        return np.sqrt(np.sum((x-y)**2, axis=1))
    
    print "centroid points.."
    for i,c in enumerate(centroids):
        idx = np.argmin(dist(c,data[labels==i]))
        print words[labels==i][idx]
        print words[labels==i]
    
    plt.plot(centroids[:,0],centroids[:,1],'x')
    plt.hold(True)
    plt.plot(u[:,0], u[:,1], '.')
    plt.show()
    
    from mpl_toolkits.mplot3d import Axes3D
    fig = plt.figure()
    ax = Axes3D(fig)
    ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
            zdir='z', label='zs=0, zdir=z')
    plt.show()
    

    结果

    any
    ['and' 'an' 'can' 'any']
    do
    ['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
    when
    ['who' 'when' 'well']
    my
    ['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
    your
    ['for' 'or' 'out' 'about' 'your' 'our']
    its
    ['it' 'his' 'if' 'him' 'its']
    could
    ['would' 'people' 'could']
    this
    ['this' 'think' 'these']
    she
    ['the' 'he' 'she' 'see']
    back
    ['all' 'back' 'want']
    one
    ['of' 'on' 'one' 'only' 'even' 'new']
    just
    ['but' 'just' 'first' 'most']
    come
    ['some' 'come']
    that
    ['that' 'than']
    way
    ['say' 'what' 'way' 'day']
    like
    ['like' 'time' 'give']
    in
    ['in' 'into']
    get
    ['her' 'get' 'year']
    because
    ['because']
    will
    ['with' 'will' 'which']
    over
    ['other' 'over' 'after']
    as
    ['a' 'as' 'at' 'also' 'us']
    them
    ['they' 'there' 'their' 'them' 'then']
    good
    ['not' 'from' 'know' 'good' 'now' 'look' 'work']
    have
    ['have' 'make' 'take']
    

    对于簇数量的k的选择是重要的,例如,k = 25比k = 20给出更好的结果 .

    该代码还通过选择其u [..]坐标最接近聚类质心的单词为每个聚类选择代表性单词 .

相关问题