我是Python新手,无法用Python来思考这个问题。我有一个短信的文本文件。我想捕捉多行声明。
import fileinput
parsed = {}
for linenum, line in enumerate(fileinput.input()):
### Process the input data ###
try:
parsed[linenum] = line
except (KeyError, TypeError, ValueError):
value = None
###############################################
### Now have dict with value: "data" pairing ##
### for every text message in the archive #####
###############################################
for item in parsed:
sent_or_rcvd = parsed[item][:4]
if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n':
###########################################
### Know we have a second or third line ###
###########################################
但在这里,我撞墙了。我不确定包含我在这里的字符串的最佳方法是什么。我喜欢一些专家的意见。使用Python 2.7.3但很高兴转到3。
目标:拥有一个人类可读的文件,其中包含来自这些短信的三行引号。
示例文字:
12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump
--
(是的,在你问之前,这是关于便便的ha句。我在过去5年发短信给我最好的朋友时试图捕捉它们。)
理想情况下会产生类似:
海普3 2011-03-19
比流量更大胆 比快乐更麻烦;
再见岩石转储
答案 0 :(得分:1)
良好的开端可能如下所示。我正在从名为 data2 的文件中读取数据,但read_messages
生成器将使用来自任何可迭代的行。
#!/usr/bin/env python
def read_messages(file_input):
message = []
for line in file_input:
line = line.strip()
if line[:4].lower() in ('rcvd', 'sent', '--'):
if message:
yield message
message = []
else:
message.append(line)
if message:
yield message
with open('data2') as file_input:
for msg in read_messages(file_input):
print msg
这要求输入如下所示:
sent
message sent away
it has multiple lines
--
rcvd
message received
rcvd
message sent away
it has multiple lines
答案 1 :(得分:1)
import time
data = """12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump """.splitlines()
def get_haikus(lines):
haiku = None
for line in lines:
try:
ID, timestamp, txt = line.split('|')
t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
ID = int(ID)
if haiku and len(haiku[1]) ==3:
yield haiku
haiku = (timestamp, [txt])
except ValueError: # happens on error with split(), time or int conversion
haiku[1].append(line)
else:
yield haiku
# now get_haikus() returns tuple (timestamp, [lines])
for haiku in get_haikus(data):
timestamp, text = haiku
date = timestamp.split()[0]
text = '\n'.join(text)
print """{d}\n{txt}""".format(d=date, txt=text)