我正在尝试为以下数据创建一个正则表达式
12/07/16, 2:18 AM - ABC1: Anyway... this is ... abc: !?
:) Yea, this is next line - Multi line statements
12/07/16, 2:19 AM - User27: John, Bob, Him, I, May,2 ,3 100... multiple values
10/07/16, 2:41 PM - ABC1: Singe line statements
10/07/16, 2:41 PM - ABC1: Good
10/07/16, 2:45 PM - ABC1: Emojis statements, multiline, different languages
我的正则表达式-
(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s
上述正则表达式可以正常工作直到
12/07/16, 2:18 AM -
我尝试处理最后一位(用户名和消息)-
(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s(^[A-Z][0-9]$)
无法选择邮件或用户名。
我正在努力为消息片段创建正则表达式,因为它涉及换行符,空格,表情符号,不同的语言,而且我不知道USERNAME或MESSAGE的长度。
我正在使用Debugger验证我的正则表达式和此cheatsheet
我愿意接受任何改进和建议。谢谢!
答案 0 :(得分:0)
This是对您的正则表达式的修改
(?s)(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s(User\d+):\s*(.*?)(?=(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s|\Z)
正则表达式细分
(?s) #Dot matches new line
(\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s #Same as above
(User\d+)\s*:\s* #Match username followed by :
(.*?) #Find the message lazily till the below conditions
(?=
(?:\d{1,2}\/\d{2}/\d{2})\,\s(\d{1,2}\:\d{2}\s\w{2})\s\-\s #Till the same format is found
|
\Z #or we reach end of string
)
编辑:如评论中所述,文件应该在单个变量的内存中
答案 1 :(得分:0)
您不必将整个文件读入内存。您可以逐行读取文件,检查起始行模式是否匹配,如果不是以该模式开头的行,则继续在临时字符串中添加行,然后追加到结果中(或写入另一个文件,数据框,等),找到与日期时间模式匹配的文件末尾或另一行:
import re
values = []
start_matching = False
val = ""
r=re.compile(r"\d{1,2}/\d{2}/\d{2},\s\d{1,2}:\d{2}\s\w{2}\s-\s")
with open('path/to/file', 'r') as f:
for line in f:
if r.match(line.strip()):
start_matching = True
if val:
values.append(val.rstrip()) # stripping trailing whitespace and write to result
val = ""
val += line
else:
if start_matching:
val += line
if val:
values.append(val.rstrip()) # stripping trailing whitespace and write the tail to result
如果您使用
for v in values:
print(v)
print("-------")
输出将是
12/07/16, 2:18 AM - ABC1: Anyway... this is ... abc: !?
:) Yea, this is next line - Multi line statements
-------
12/07/16, 2:19 AM - User27: John, Bob, Him, I, May,2 ,3 100... multiple values
-------
10/07/16, 2:41 PM - ABC1: Singe line statements
-------
10/07/16, 2:41 PM - ABC1: Good
-------
10/07/16, 2:45 PM - ABC1: Emojis statements, multiline, different languages
-------