首页 文章

从文本中删除大型字符串列表

提问于
浏览
1

假设

txt='Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'

是一个很大的文本,我想删除一个大的字符串列表,如

removalLists=['Daniel Johnson','Ana Hickman']

从他们 . 我的意思是我想要替换列表中的所有元素

' '

我知道我可以轻松地使用循环来实现这一点

for string in removalLists:
    txt=re.sub(string,' ',txt)

我想知道我是否可以更快地完成它 .

1 回答

  • 3

    一种方法是生成单个正则表达式模式,其是替换项的替换 . 因此,我建议使用以下正则表达式模式,例如:

    \bDaniel Johnson\b|\bAna Hickman\b
    

    为了生成这个,我们可以首先用词边界( \b )包装每个术语 . 然后,使用 | 作为分隔符将列表折叠为单个字符串 . 最后,我们可以使用 re.sub 用单个空格替换任何术语的所有出现 .

    txt = 'Daniel Johnson and Ana Hickman are friends. They know each other for a long time. Daniel Johnson is a professor and Ana Hickman is writer.'
    removalLists = ['Daniel Johnson','Ana Hickman']
    
    regex = '|'.join([r'\b' + s + r'\b' for s in removalLists])
    output = re.sub(regex, " ", txt)
    
    print(output)
    
      and   are friends. They know each other for a long time.   is a professor and   is writer.
    

相关问题