如何从Python中提取文件中的唯一电子邮件地址列表-Java 学习之路

我试图从.txt文件（https://www.py4e.com/code3/mbox.txt）中提取包含多个电子邮件的唯一电子邮件地址列表 . 我可以通过以下程序将搜索范围缩小到'From:'和'To:'来提取电子邮件地址列表：

import re
in_file = open('dummy_text_file.txt')
for line in in_file:
if re.findall('^From:.+@([^\.]*)\.', line):
    countFromEmail = countFromEmail + 1
    print(line)
if re.findall('^To:.+@([^\.]*)\.', line):
    print(line)

但是，由于各种电子邮件地址重复，这并没有为我提供唯一的列表 . 此外，最终打印的内容如下所示：

致：java-user@lucene.apache.org

来自：Adrien Grand jpountz@gmail.com

我期待只列出没有'to'，'from'或尖括号（<>）的实际电子邮件地址 .

我不熟悉Python，但我接近这个的原始方法是提取纯电子邮件地址，并可能将它们存储在某处并创建一个for循环以将它们添加到列表中 .

任何帮助或指向正确的方向将不胜感激 .

3 回答

0

要获取唯一的电子邮件列表，我会查看以下两篇文章：

https://www.peterbe.com/plog/uniqifiers-benchmark

How do you remove duplicates from a list whilst preserving order?

要将 Adrien Grand < jpountz@gmail.com > 解析为其他格式，以下链接应包含您需要的所有信息 .

https://docs.python.org/3.7/library/email.parser.html#module-email.parser

不幸的是，我没时间给你写一个例子，但我希望这会有所帮助 .

回复于 2024-04-28T19:48:01+08:00

最简单的方法是 set() .

集仅包含唯一值 .

array = [1, 2, 3, 4, 5, 5, 5]
unique_array= set(array)
print(unique_array)  # {1, 2, 3, 4, 5}

回复于 2024-04-28T19:48:01+08:00

import re
in_file = open('mbox.txt')
countFromEmail = 0
unique_emails = set() #using a set to maintain an unique list
for line in in_file:
    if re.findall('^From:.+@([^\.]*)\.', line):
        countFromEmail += 1
        line = line.replace("From:","") #replacing the string
        line = line.strip() # then trimming the white spaces
        unique_emails.add(line) #adding to the set
    if re.findall('^To:.+@([^\.]*)\.', line):
        line = line.replace("To:","") #replacing the string
        line = line.strip() #then trimming the white spaces
        unique_emails.add(line) #adding to the set
for email in unique_emails:
    print email

您可以通过多种不同方式实现此结果 . 使用集合的集合可以是其中之一 . 由于集合中的元素是唯一的（插入时默认情况下会丢弃任何重复的元素） .

我已经为您编辑和评论了您的代码 . 希望这可以帮助 . 干杯! :)

-Sunjun

回复于 2024-04-28T19:48:01+08:00

如何从Python中提取文件中的唯一电子邮件地址列表

3 回答

相关问题