Question

我正在尝试对我的WhatsApp对话历史记录进行一些分析。我正在使用Python3。我正在使用的文件是UTF-8 txt文件，其中文件中的每一行都采用以下格式：

dd/mm/yyyy, hh:mm - Sender: Message Contents

文件中的示例行：

18/02/2018, 16:47 - John Smith: Hello World!

我有一个正则表达式可以正确提取日期时间，发件人和消息：

chat = open(file, 'r',encoding='utf-8') #open WhatsApp .txt file

chatText = chat.read() # read its contents
messages = re.findall('(\d+/\d+/\d+, \d+\:\d+) - (.*)\: (.*)\n', chatText)

chat.close()
conversation = pd.DataFrame(messages,columns=['date_time','sender','msg'])

但是，当Message Contents包含:时，此方法似乎失效。例如：

18/02/2018, 16:47 - John Smith: You should take one of the following buses: 33, 159 or 263.

在这种情况下，正则表达式将Sender组解释为John Smith: You should take one of the following buses。

我可以更改我的正则表达式来处理包含冒号的邮件吗？

重要：我不想限制Sender的格式。某些会话的Sender格式可能不是fistname lastname，而格式可能是firstname，firstname middlename lastname或其他一些格式，包括数字或特殊字符，例如国际手机号码。

在冒号之后找到所有内容，包括其他冒号匹配组-Regex，Python3

0 个答案: