Question

我是Python新手，无法用Python来思考这个问题。我有一个短信的文本文件。我想捕捉多行声明。

import fileinput

parsed = {}

for linenum, line in enumerate(fileinput.input()):
### Process the input data ###
    try:
        parsed[linenum] = line
    except (KeyError, TypeError, ValueError):
        value = None
###############################################
### Now have dict with value: "data" pairing ##
### for every text message in the archive #####
###############################################
for item in parsed:
    sent_or_rcvd = parsed[item][:4]
    if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n':
        ###########################################
        ### Know we have a second or third line ###
        ###########################################

但在这里，我撞墙了。我不确定包含我在这里的字符串的最佳方法是什么。我喜欢一些专家的意见。使用Python 2.7.3但很高兴转到3。

目标：拥有一个人类可读的文件，其中包含来自这些短信的三行引号。

示例文字：

12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump 
--

（是的，在你问之前，这是关于便便的ha句。我在过去5年发短信给我最好的朋友时试图捕捉它们。）

理想情况下会产生类似：

海普3   2011-03-19
  比流量更大胆   比快乐更麻烦;
  再见岩石转储

Answer 1

良好的开端可能如下所示。我正在从名为 data2 的文件中读取数据，但read_messages生成器将使用来自任何可迭代的行。

#!/usr/bin/env python

def read_messages(file_input):
    message = []
    for line in file_input:
        line = line.strip()
        if line[:4].lower() in ('rcvd', 'sent', '--'):
            if message:
                yield message
                message = []
        else:
            message.append(line)
    if message:
        yield message


with open('data2') as file_input:
    for msg in read_messages(file_input):
        print msg

这要求输入如下所示：

sent
message sent away
it has multiple lines
--
rcvd
message received
rcvd
message sent away
it has multiple lines

Answer 2

import time

data = """12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump """.splitlines()

def get_haikus(lines):
    haiku = None
    for line in lines:
        try:
            ID, timestamp, txt = line.split('|')
            t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
            ID = int(ID)
            if haiku and len(haiku[1]) ==3:
                yield haiku
            haiku = (timestamp, [txt])
        except ValueError: # happens on error with split(), time or int conversion
            haiku[1].append(line)
    else:
        yield haiku

# now get_haikus() returns tuple (timestamp, [lines])
for haiku in get_haikus(data):
    timestamp, text = haiku
    date = timestamp.split()[0]
    text = '\n'.join(text)
    print """{d}\n{txt}""".format(d=date, txt=text)

在Python中确定行的模式

2 个答案: