我的数据看起来像
04/07/16, 12:51 AM - User1: Hi
04/07/16, 8:19 PM - User2: Here’s a link for you
https://www.abcd.com/folder/1SyuIUCa10tM37lT0F8Y3D
04/07/16, 8:29 PM - User2: Thanks
使用下面的代码,我可以将每条消息分成每行
data = []
for line in open('/content/drive/My Drive/sample.txt'):
items = line.rstrip('\r\n').split('\t') # strip new-line characters and split on column delimiter
items = [item.strip() for item in items] # strip extra whitespace off data items
data.append(items)
但是,我不想分割换行符所在的行 然后是链接。例如,第3行和第4行是一条消息,但 他们由于换行符而分裂。
。
有没有一种方法可以避免在换行符后接http
时拆分?
答案 0 :(得分:0)
可能可以对其进行优化,但是它可以工作:
data = []
prev = ''
with open('C:/Users/kavanaghal/python/sample.txt', 'r', encoding='utf-8') as f:
prev = f.readline().strip()
while True:
nxt = f.readline().strip()
if 'http' in nxt:
data.append(prev + ": " + nxt)
prev = f.readline()
continue
data.append(prev)
prev = nxt
if not nxt:
break
print(data)
>> ['04/07/16, 12:51 AM - User1: Hi',
'04/07/16, 8:19 PM - User2: Here's a link for you: https://www.abcd.com/folder/1SyuIUCa10tM37lT0F8Y3D',
'04/07/16, 8:29 PM - User2: Thanks']
答案 1 :(得分:0)
一种方法是将其附加到列表中的最后一个条目上:
import re
data = []
with open('sample.txt', 'r') as f: # use open so the file closes automatically
for line in f.readlines():
if len(data) >= 1 and re.match(r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', line):
data[len(data) - 1] += f" {line.strip()}"
else:
data.append(line.strip())
for x in data:
print(x)
输出:
04/07/16, 12:51 AM - User1: Hi
04/07/16, 8:19 PM - User2: Here’s a link for you https://www.abcd.com/folder/1SyuIUCa10tM37lT0F8Y3D
04/07/16, 8:29 PM - User2: Thanks
为正则表达式提供信用:Regex to extract URLs...
答案 2 :(得分:0)
您需要一次读取整个文件:
all_lines = []
for index, line in enumerate(split):
next_index = index + 1
if next_index < len(split) and "https" in split[next_index]:
line += split[next_index]
del split[next_index]
all_lines.append(line)
答案 3 :(得分:0)
事后操作
data = []
for line in open('/content/drive/My Drive/sample.txt'):
items = [item.strip() for item in line.rstrip('\r\n').split('\t')]
### now it is different from your code ###############################
if items[0].startswith('http'):
data[-1].append(items[0])
else:
data.append(items)
您可能希望使用正则表达式或其他替代.startswith()
来更好地控制匹配的内容,但这应该可以帮助您入门。