我正在尝试将WhatsApp聊天导出到一个数据框中,然后进行分析。但是我得到一个空的数据框。
下面是聊天记录文件chat.txt
中的一小部分示例:
07/02/19, 3:08 pm - Messages to this group are now secured with end-to-end encryption. Tap for more info.
22/01/19, 3:27 pm - kai Sir created group "Weekday batch 201901"
07/02/19, 3:08 pm - kai Sir added you
07/02/19, 3:08 pm - kai Sir removed +91 85949 03087
07/02/19, 3:08 pm - kai Sir changed the subject from "Weekday batch 201901" to "Weekday batch 201902"
07/02/19, 3:09 pm - kai Sir: Hi All this is weekday batch staring from 11th Feb from morning 7.30 am to 10 am
我的代码:
import pandas as pd
import re
import itertools
def parse_file(text_file):
#Convert WhatsApp chat log text file to a Pandas dataframe.
# some regex to account for messages taking up multiple lines
pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
with open(text_file, encoding='latin1') as f:
data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(f.read())]
sender = []; message = []; datetime = []
for row in data:
# timestamp is before the first dash
datetime.append(row.split(' - ')[0])
# sender is between am/pm, dash and colon
try:
s = re.search('m - (.*?):', row).group(1)
sender.append(s)
except:
sender.append('')
# message content is after the first colon
try:
message.append(row.split(': ', 1)[1])
except:
message.append('')
df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')
# remove events not associated with a sender
df = df[df.sender != ''].reset_index(drop=True)
return df
df = parse_file(r"C:\Users\RASHMI\Desktop\python_full\chat.txt")
我的输出低于输出。
In [17]: df Out[17]: Empty DataFrame Columns: [timestamp, sender, message] Index: []