匹配来自WhatsApp的消息登录python-Java 学习之路

我想提取所有与WhatsApp中的消息匹配的模式 . 消息具有以下形式：

一行消息：

[19.09.17, 19:54:48] Marc: If the mean is not in the thousands, there's the problem

多行长消息：

[19.09.17, 19:54:59] Joe: > mean(aging$Population)
[1] 1593.577
Is what I get as solution

我能够将它分成日期，时间，发件人和消息，但只能通过首先读取文本文件行中的单行，然后在不同的分隔符上拆分这些行 . 但是，这对于具有多行的消息不起作用 . 现在我正在尝试使用正则表达式，我能够获得日期和时间，但我仍然在努力将消息模式扩展到多行 .

## reg expressions to match different things in the log
date = r'\[\d+\.\d+\.\d+,'
time = r'\d+\:\d+\:\d+]'
message = r':\s+.+\['
message = re.compile(message, re.DOTALL)

请注意，我的日志来自德国WhatsApp，这就是为什么日期有点不同 . 我也结束了，并且确保我不会意外地从消息中获得匹配 .

我想对消息模式做同样的事情，结束于[通常是下一行的开头（但如果可以在新行的消息中找到它可能不是很强大） .

可能有一种更简单的解决方案，但我（正如你所见）真的很糟糕的正则表达式 .

1 回答

这是使用 re.findall 的一般正则表达式和解决方案：

msg = "[19.09.17, 19:54:48] Marc: If the mean is not in the thousands, there's the problem
    [19.09.17, 19:54:59] Joe: > mean(aging$Population)
    [1] 1593.577\nIs what I get as solution"

results = re.findall(r"\[(\d{2}\.\d{2}\.\d{2}), (\d{2}:\d{2}:\d{2})\] ([^:]+): (.*?)(?=\[\d{2}\.\d{2}\.\d{2}, \d{2}:\d{2}:\d{2}\]|$)", msg, re.MULTILINE|re.DOTALL)

for item in results:
    print "date: " + item[0]
    print "time: " + item[1]
    print "sender: " + item[2]
    print "message: " + item[3]

date: 19.09.17
time: 19:54:48
sender: Marc
message: If the mean is not in the thousands, there's the problem
date: 19.09.17
time: 19:54:59
sender: Joe
message: > mean(aging$Population)

该模式显得冗长而臃肿，与您预期的WhatsApp消息的结构相匹配 . 值得注意的是，该模式使用多线和DOT ALL模式 . 对于可能跨越多行的消息，这是必需的 . 当模式看到下一条消息的开始（特别是时间戳）或者看到输入的结束时，模式停止消耗给定的消息 .

回复于 2024-05-17T17:28:03+08:00

匹配来自WhatsApp的消息登录python

1 回答

相关问题