我正在尝试解析WhatsApp聊天日志并将其组织到DataFrame中。我设法将其分为日期,时间,名称和消息。但是,当我创建DataFrame时,有些换行消息是前一条消息的延续,并且在“日期”列中。我希望将它们附加到上一个消息单元格中。
这是原始.txt文件的外观(我编写了一些文本来隐藏消息):
11/28/17, 10:00 AM - Bob: Lorem ipsum dolor sit amet
11/28/17, 10:00 AM - Marley: Yes!
11/28/17, 10:00 AM - Marley: consectetur adipiscing elit
11/28/17, 10:00 AM - Bob: Barely dude. BARELY
11/28/17, 10:01 AM - Bob: sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
11/28/17, 10:14 AM - Marley: Ut enim ad minim veniam
11/28/17, 10:20 AM - Marley: quis nostrud exercitation
Duis aute irure dolor in
11/28/17, 10:31 AM - Bob: Hahaha proud
11/28/17, 10:31 AM - Bob: Can't imagine
如您所见,在11/28/17,上午10:20的消息是两行。我想将多余的消息行附加到上一行的消息中。当我转换为DataFrame时,所有内容都将属于正确的列。到目前为止,这是我的代码:
import pandas as pd
with open('whatsapp.txt', encoding="utf8") as f:
mylist = list(f)
df = pd.DataFrame(mylist)
df = df[0].str.split(r'[,-]', 2, expand=True)
df = df.rename(columns={0:"Date",1:"Time",2:"Name"})
df = df.replace('\n','', regex=True)
df[['Name','Message']] = df['Name'].str.split(':',1,expand=True)
我的逻辑是在创建DataFrame之前查找不以'\ d {1,2} /'开头的列表元素,然后将其附加到上一个元素的末尾。关于如何执行此操作的任何想法?
答案 0 :(得分:1)
尝试一下,和我一起工作。
from io import StringIO
import itertools
import re
import pandas as pd
a ="""11/28/17, 10:00 AM - Bob: Lorem ipsum dolor sit amet
11/28/17, 10:00 AM - Marley: Yes!
11/28/17, 10:00 AM - Marley: consectetur adipiscing elit
11/28/17, 10:00 AM - Bob: Barely dude. BARELY
11/28/17, 10:01 AM - Bob: sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
11/28/17, 10:14 AM - Marley: Ut enim ad minim veniam
11/28/17, 10:20 AM - Marley: quis nostrud exercitation
Duis aute irure dolor in
11/28/17, 10:31 AM - Bob: Hahaha proud
11/28/17, 10:31 AM - Bob: Can't imagine"""
text = StringIO(a)
lines = []
for i, line in enumerate(text):
if re.match(r"^\d+.*$",line):
lines.append(line.strip('\n'))
else:
lines[i-1] = lines[i-1]+' ' + line.strip('\n')
date, time, name, message = [], [], [],[]
for item in lines:
x = list(itertools.chain.from_iterable((i.split(",") for i in item.split("-"))))
date.append(x[0])
time.append(x[1])
x2 = x[2].split(':')
name.append(x2[0])
message.append(x2[1])
df = pd.DataFrame({'date': date, 'time':time, 'name': name, 'message': message})
pd.options.display.max_colwidth = 200
df
date time name message
0 11/28/17 10:00 AM Bob Lorem ipsum dolor sit amet
1 11/28/17 10:00 AM Marley Yes!
2 11/28/17 10:00 AM Marley consectetur adipiscing elit
3 11/28/17 10:00 AM Bob Barely dude. BARELY
4 11/28/17 10:01 AM Bob sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
5 11/28/17 10:14 AM Marley Ut enim ad minim veniam
6 11/28/17 10:20 AM Marley quis nostrud exercitationDuis aute irure dolor in
7 11/28/17 10:31 AM Bob Hahaha proud
8 11/28/17 10:31 AM Bob Can't imagine
如果使用打开方式读取文件,则for循环将更改:
with open('whatsapp.txt') as f:
for i, line in enumerate(f):
if re.match(r"^\d+.*$",line):
lines.append(line.strip('\n'))
else:
lines[i-1] = lines[i-1]+line.strip('\n')
感谢jDo的评论。最好使用try,除非使用dateutil检查该行中的前7个索引是否为日期格式。
from dateutil.parser import parse
lines = []
with open('whatsapp.txt') as f:
for i, line in enumerate(f):
try:
parse(line[:8])
lines.append(line.strip('\n'))
except:
lines[i-1] = lines[i-1]+' '+ line.strip('\n')
答案 1 :(得分:1)
这是使用more-itertools的简短解决方案。
from fnmatch import fnmatch
from more_itertools import split_before
with open('whatever_file.txt', 'rt') as infile:
for group in split_before(infile, lambda s: fnmatch(s, '*/*, *:* * - *')):
print(group)
输出为:
['11/28/17, 10:00 AM - Bob: Lorem ipsum dolor sit amet\n']
['11/28/17, 10:00 AM - Marley: Yes!\n']
['11/28/17, 10:00 AM - Marley: consectetur adipiscing elit\n']
['11/28/17, 10:00 AM - Bob: Barely dude. BARELY\n']
['11/28/17, 10:01 AM - Bob: sed do eiusmod tempor incididunt ut labore et dolore magna aliqua\n']
['11/28/17, 10:14 AM - Marley: Ut enim ad minim veniam\n']
['11/28/17, 10:20 AM - Marley: quis nostrud exercitation\n', 'Duis aute irure dolor in\n']
['11/28/17, 10:31 AM - Bob: Hahaha proud\n']
["11/28/17, 10:31 AM - Bob: Can't imagine\n"]
这可以通过将可迭代项(此处为文件的行)拆分到任何以日期格式开头的行的任何位置来进行。我为此使用了fnmatch
,但是您也可以使用一个使用datetime.strptime
的函数。
答案 2 :(得分:0)
如果截断的数据中只有一行,那么哈利勒(Khalil)的帖子将是完美的,我假设OP的数据比该示例更长。当您按行[i-1]进行计数时,这是假设仅截断了一条消息。您必须计算发生中断的次数才能获得行列表的正确索引,才能将中断消息附加到原始消息。我用计数器j计算了截止次数,请参见下面的编辑。我还使用了将数据分成几列的OPs方法,因为我发现它更像pythonic。
import pandas as pd
import re
import itertools
lines = [] #raw text index
j = 0
with open('whatsapp.txt', encoding="utf8", mode="r") as f:
for i, line in enumerate(f):
if re.match(r"^\d{1,2}/.*$",line):
lines.append(line.strip('\n'))
else:
j += 1
lines[i-j] = lines[i-j]+ ' ' + line.strip('\n')
#for line in lines:
#print(line)
df = pd.DataFrame(lines)
df = pd.DataFrame(lines)
df = df[0].str.split(r'[,-]', 2, expand=True)
df = df.rename(columns={0:"Date",1:"Time",2:"Name"})
df = df.replace('\n','', regex=True)
df[['Name','Message']] = df['Name'].str.split(':',1,expand=True)
#df[34:45]
print(df)
答案 3 :(得分:0)
我测试了此处发布的大多数其他解决方案,发现它们都无法处理以与实际日志行相同的模式开头的连续行;例如从时间戳开始的连续行。我认为可以通过匹配不与新日志行格式不匹配的换行符之后的所有内容来改善此问题:
\d{2}/\d{2}/\d{2}, \d{2}:\d{2} AM|PM) - (.*): (.*)
我将此可怕的行添加到您的数据中,以确认以下解决方案有效:
11/28/17, 10:14 AM - Marley: THIS LINE CONTAINS A LINE BREAK FOLLOWED BY A TIMESTAMP AND SOME ARBITRARY TEXT BUT SHOULD *NOT* BE SEEN AS TWO LOG ENTRIES
01/01/01, 11:11 AM - CONTINUATION OF THE LINE ABOVE.
代码:
import re
import pandas as pd
log = """11/28/17, 10:00 AM - Bob: Lorem ipsum dolor sit amet
11/28/17, 10:00 AM - Marley: Yes!
11/28/17, 10:00 AM - Marley: consectetur adipiscing elit
11/28/17, 10:00 AM - Bob: Barely dude. BARELY
11/28/17, 10:01 AM - Bob: sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
11/28/17, 10:14 AM - Marley: Ut enim ad minim veniam
11/28/17, 10:14 AM - Marley: THIS LINE CONTAINS A LINE BREAK FOLLOWED BY A TIMESTAMP AND SOME ARBITRARY TEXT BUT SHOULD *NOT* BE SEEN AS TWO LOG ENTRIES
01/01/01, 11:11 AM - CONTINUATION OF THE LINE ABOVE.
11/28/17, 10:20 AM - Marley: quis nostrud exercitation
Duis aute irure dolor in
11/28/17, 10:31 AM - Bob: Hahaha proud
11/28/17, 10:31 AM - Bob: Can't imagine"""
pat = re.compile(r"(?P<timestamp>\d{2}/\d{2}/\d{2}, \d{2}:\d{2} AM|PM) - (?P<author>.*): (?P<message>(.*\n(?!(\d{2}/\d{2}/\d{2}, \d{2}:\d{2} AM|PM) - (.*): (.*)).*)|.*)", re.MULTILINE)
df = pd.DataFrame([match.groupdict() for match in pat.finditer(log)])
Python3.5输出:
>>> df
author message \
0 Bob Lorem ipsum dolor sit amet
1 Marley Yes!
2 Marley consectetur adipiscing elit
3 Bob Barely dude. BARELY
4 Bob sed do eiusmod tempor incididunt ut labore et ...
5 Marley Ut enim ad minim veniam
6 Marley THIS LINE CONTAINS A LINE BREAK FOLLOWED BY A ...
7 Marley quis nostrud exercitation\nDuis aute irure dol...
8 Bob Hahaha proud
9 Bob Can't imagine
timestamp
0 11/28/17, 10:00 AM
1 11/28/17, 10:00 AM
2 11/28/17, 10:00 AM
3 11/28/17, 10:00 AM
4 11/28/17, 10:01 AM
5 11/28/17, 10:14 AM
6 11/28/17, 10:14 AM
7 11/28/17, 10:20 AM
8 11/28/17, 10:31 AM
9 11/28/17, 10:31 AM
>>>
如果某些日志行包含多个换行符,则可以改用这种迭代方法:
import re
import pandas as pd
log = """11/28/17, 10:00 AM - Bob: Lorem ipsum dolor sit amet
11/28/17, 10:00 AM - Marley: Yes!
11/28/17, 10:00 AM - Marley: consectetur adipiscing elit
11/28/17, 10:00 AM - Bob: Barely dude. BARELY
11/28/17, 10:01 AM - Bob: sed do eiusmod tempor incididunt ut labore et dolore magna aliqua
11/28/17, 10:14 AM - Marley: Ut enim ad minim veniam
11/28/17, 10:14 AM - Marley: THIS LINE CONTAINS A LINE BREAK FOLLOWED BY A TIMESTAMP AND SOME ARBITRARY TEXT BUT SHOULD *NOT* BE SEEN AS TWO LOG ENTRIES
01/01/01, 11:11 AM - CONTINUATION OF THE LINE ABOVE.
01/01/01, 11:11 AM - CONTINUATION OF THE LINE ABOVE.
01/01/01, 11:11 AM - CONTINUATION OF THE LINE ABOVE.
11/28/17, 10:20 AM - Marley: quis nostrud exercitation
Duis aute irure dolor in
11/28/17, 10:31 AM - Bob: Hahaha proud
11/28/17, 10:31 AM - Bob: Can't imagine"""
pat = re.compile(r"(?P<timestamp>\d{2}/\d{2}/\d{2}, \d{2}:\d{2} AM|PM) - (?P<author>.*): (?P<message>.*)")
lines = []
last_line = {}
for line in log.split("\n"):
line = line.strip()
match = pat.match(line)
if match:
last_line = match.groupdict()
lines.append(last_line)
elif last_line:
last_line["message"] += " {}".format(line)
df = pd.DataFrame(lines)
您可以通过进一步限制某些部分来改进正则表达式。例如,\d{2}/\d{2}/\d{2}
与99/99/99
匹配,这显然不是有效日期。