我有一个未知长度的字符串,可以重复多次感兴趣的模式。 字符串看起来像这样:
blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
JOHNNYSMITH已于12/05/2017 14:18输入上述说明 blahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
JOHNNYSMITH已于12/05/2017 14:19进入上述说明
SARAHJOHNSON已于2017年5月12日17:45进入上述说明 blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
SARAHJOHNSON已于12/05/2017 17:46进入上述说明
我正在尝试将评论,用户名和日期分开来创建一个更好看的评论框(带有一些CSS)。以下是我必须分离用户名
before_keyword, keyword, after_keyword = stringg.partition("has entered the above notes on ")
namedate = before_keyword.split()[-1] + "--" + after_keyword.split()[0] + after_keyword.split()[1]
comment = before_keyword.replace(before_keyword.split()[-1], '').rstrip()
print comment
print namedate
这适用于第一个用户名输入上述注释的情况。如何遍历字符串以收集字符串中的所有注释/用户名/日期并单独打印出来。
感谢。
编辑:输入假名而不是USERNAME2389来显示名称的显示方式。
答案 0 :(得分:0)
您可以遍历这些行,创建一个文本占位符,当用户名命中时将其附加到数据框,这样最后您就拥有了一个漂亮的,可操作的数据集。您还可以直接转换日期时间,以便分析更多时间,日期等。
import re
import pandas as pd
from datetime import datetime
string = """
blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
USERNAME2398 has entered the above notes on 12/05/2017 14:18
blahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
USERNAME2839 has entered the above notes on 12/05/2017 14:19
USERNAME7348 has entered the above notes on 12/05/2017 17:45
blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
USERNAME857 has entered the above notes on 12/05/2017 17:46
"""
# define regex for username matching
username = re.compile('USERNAME.*?\s', re.IGNORECASE)
# define regex for datetime matching
datetime_re = re.compile('[0-9]{1,2}\/[0-9]{1,2}\/(20|19)[0-9]{2}\s[0-9]{1,2}\:[0-9]{1,2}')
# create placeholder datfarame
masterdf = pd.DataFrame()
# define text placeholder
cur_text = ''
for line in string.split('\n'):
if datetime_re.search(line) and all([x.isupper() for x in line.split()[0]]):
# pull out username
cur_user = line.split()[0].strip()# username.search(line).group(0)
# pull out datetime
cur_datetime = datetime_re.search(line).group(0)
# convert to datetime object
cur_datetime = datetime.strptime(cur_datetime, '%m/%d/%Y %H:%M')
# create row to append to dataframe
row = pd.DataFrame({'user': cur_user,
'datetime': cur_datetime,
'text': cur_text}, index = [0])
# append row to dataframe
masterdf = masterdf.append(row)
# reinit cur_text
cur_text = ''
else:
# if not a username line, continue appending the commentary for the user
cur_text += line
答案 1 :(得分:0)
我会用正则表达式来做这件事。
只需遍历每一行(FOREACH),然后测试该表达式的行:
(USERNAME\S*) has entered the above notes on (\d{1,2}/\d{1,2}/\d{4}) (\d{1,2}:\d{1,2})
如果此行匹配,则您有3条信息(括号内):用户名,日期和时间。将之前的行存储在数组(缓冲区)中,然后就可以获得文本了。
答案 2 :(得分:0)
Bernz的解决方案有效,我使用的代码如下所示。 datawrestler的答案也可行。
for line in stringg.split('\n'):
if re.findall('(\w+) has entered the above notes on (\d{1,2}/\d{1,2}/\d{4}) (\d{1,2}:\d{1,2})', line):
print line.split()[0] + "--" + line.split()[-2] + line.split()[-1]
else:
print line