首页 文章

查找所有常见的非重叠子串

提问于
浏览
4

给定两个字符串,我想识别从最长到最短的所有常见子字符串 .

我想删除任何“子”子字符串 . 例如,'1234'的任何子串都不会包含在'12345'和'51234'之间的匹配中 .

string1 = '51234'

string2 = '12345'

result = ['1234', '5']

我想找到longest common substring,然后递归地找到左/右最长的子串 . 但是,我不希望在找到之后删除一个公共子字符串 . 例如,下面的结果在中间共享一个6:

string1 = '12345623456'

string2 = '623456'

result = ['623456', '23456']

最后,我需要针对数千个字符串的固定列表检查一个字符串 . 我不确定是否有一个聪明的步骤可以用来散列这些字符串中的所有子串 .

Previous Answers:

在这个thread中,发现了一个动态编程解决方案需要O(nm)时间,其中n和m是字符串的长度 . 我对更有效的方法感兴趣,它将使用后缀树 .

Background:

我正在用旋律片段创作歌曲旋律 . 有时,组合设法生成在现有的一行中匹配太多音符的旋律 .

我可以使用字符串相似性度量,例如编辑距离,但相信与旋律有很小差异的曲调是独特且有趣的 . 不幸的是,这些曲调与连续复制旋律的许多音符的歌曲具有相似的相似度 .

1 回答

  • 1

    让我们从树开始吧

    from collections import defaultdict
    def identity(x):
        return x
    
    
    class TreeReprMixin(object):
        def __repr__(self):
            base = dict(self)
            return repr(base)
    
    
    class PrefixTree(TreeReprMixin, defaultdict):
        '''
        A hash-based Prefix or Suffix Tree for testing for
        sequence inclusion. This implementation works for any
        slice-able sequence of hashable objects, not just strings.
        '''
        def __init__(self):
            defaultdict.__init__(self, PrefixTree)
            self.labels = set()
    
        def add(self, sequence, label=None):
            layer = self
            if label is None:
                label = sequence
            if label:
                layer.labels.add(label)
            for i in range(len(sequence)):
                layer = layer[sequence[i]]
                if label:
                    layer.labels.add(label)
    
            return self
    
        def add_ngram(self, sequence, label=None):
            if label is None:
                label = sequence
            for i in range(1, len(sequence) + 1):
                self.add(sequence[:i], label)
    
        def __contains__(self, sequence):
            layer = self
            j = 0
            for i in sequence:
                j += 1
                if not dict.__contains__(layer, i):
                    break
                layer = layer[i]
            return len(sequence) == j
    
        def depth_in(self, sequence):
            layer = self
            count = 0
            for i in sequence:
                if not dict.__contains__(layer, i):
                    print "Breaking"
                    break
                else:
                    layer = layer[i]
                count += 1
            return count
    
        def subsequences_of(self, sequence):
            layer = self
            for i in sequence:
                layer = layer[i]
            return layer.labels
    
        def __iter__(self):
            return iter(self.labels)
    
    
    class SuffixTree(PrefixTree):
        '''
        A hash-based Prefix or Suffix Tree for testing for
        sequence inclusion. This implementation works for any
        slice-able sequence of hashable objects, not just strings.
        '''
        def __init__(self):
            defaultdict.__init__(self, SuffixTree)
            self.labels = set()
    
        def add_ngram(self, sequence, label=None):
            if label is None:
                label = sequence
            for i in range(len(sequence)):
                self.add(sequence[i:], label=label)
    

    要填充树,您将使用 .add_ngram 方法 .

    下一部分有点棘手,因为您正在寻找并发遍历字符串同时跟踪树坐标 . 为了解决所有这些问题,我们需要一些在树上运行的函数和一个查询字符串

    def overlapping_substrings(string, tree, solved=None):
        if solved is None:
            solved = PrefixTree()
        i = 1
        last = 0
        matching = True
        solutions = []
        while i < len(string) + 1:
            if string[last:i] in tree:
                if not matching:
                    matching = True
                else:
                    i += 1
                    continue
            else:
                if matching:
                    matching = False
                    solutions.append(string[last:i - 1])
                    last = i - 1
                    i -= 1
            i += 1
        if matching:
            solutions.append(string[last:i])
        for solution in solutions:
            if solution in solved:
                continue
            else:
                solved.add_ngram(solution)
                yield solution
    
    def slide_start(string):
        for i in range(len(string)):
            yield string[i:]
    
    def seek_subtree(tree, sequence):
        # Find the node of the search tree which
        # is found by this sequence of items
        node = tree
        for i in sequence:
            if i in node:
                node = node[i]
            else:
                raise KeyError(i)
        return node
    
    def find_all_common_spans(string, tree):
        # We can keep track of solutions to avoid duplicates
        # and incomplete prefixes using a Prefix Tree
        seen = PrefixTree()
        for substring in slide_start(string):
            # Drive generator forward
            list(overlapping_substrings(substring, tree, seen))
        # Some substrings are suffixes of other substrings which you do not
        # want
        compress = SuffixTree()
        for solution in sorted(seen.labels, key=len, reverse=True):
            # A substrings may be a suffix of another substrings, but that substrings
            # is actually a repeating pattern. If a solution is
            # a repeating pattern, `not solution in seek_subtree(tree, solution)` will tell us.
            # Otherwise, discard the solution
            if solution in compress and not solution in seek_subtree(tree, solution):
                continue
            else:
                compress.add_ngram(solution)
        return compress.labels
    
    def search(query, corpus):
        tree = SuffixTree()
        if isinstance(corpus, SuffixTree):
            tree = corpus
        else:
            for elem in corpus:
                tree.add_ngram(elem)
        return list(find_all_common_spans(query, tree))
    

    所以现在做你想做的事,做到这一点:

    search("12345", ["51234"])
    search("623456", ["12345623456"])
    

    如果有什么不清楚的地方,请告诉我,我会尽力澄清 .

相关问题