在Python中确定行的模式

时间:2014-04-05 13:05:22

标签: python text

我是Python新手,无法用Python来思考这个问题。我有一个短信的文本文件。我想捕捉多行声明。

import fileinput

parsed = {}

for linenum, line in enumerate(fileinput.input()):
### Process the input data ###
    try:
        parsed[linenum] = line
    except (KeyError, TypeError, ValueError):
        value = None
###############################################
### Now have dict with value: "data" pairing ##
### for every text message in the archive #####
###############################################
for item in parsed:
    sent_or_rcvd = parsed[item][:4]
    if sent_or_rcvd != "rcvd" and sent_or_rcvd != "sent" and sent_or_rcvd != '--\n':
        ###########################################
        ### Know we have a second or third line ###
        ###########################################

但在这里,我撞墙了。我不确定包含我在这里的字符串的最佳方法是什么。我喜欢一些专家的意见。使用Python 2.7.3但很高兴转到3。

目标:拥有一个人类可读的文件,其中包含来自这些短信的三行引号。

示例文字:

12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump 
--

(是的,在你问之前,这是关于便便的ha句。我在过去5年发短信给我最好的朋友时试图捕捉它们。)

理想情况下会产生类似:

  

海普3   2011-03-19
  比流量更大胆   比快乐更麻烦;
  再见岩石转储

2 个答案:

答案 0 :(得分:1)

良好的开端可能如下所示。我正在从名为 data2 的文件中读取数据,但read_messages生成器将使用来自任何可迭代的行。

#!/usr/bin/env python

def read_messages(file_input):
    message = []
    for line in file_input:
        line = line.strip()
        if line[:4].lower() in ('rcvd', 'sent', '--'):
            if message:
                yield message
                message = []
        else:
            message.append(line)
    if message:
        yield message


with open('data2') as file_input:
    for msg in read_messages(file_input):
        print msg

这要求输入如下所示:

sent
message sent away
it has multiple lines
--
rcvd
message received
rcvd
message sent away
it has multiple lines

答案 1 :(得分:1)

import time

data = """12425234123|2011-03-19 11:03:44|words words words words
12425234123|2011-03-19 11:04:27|words words words words
12425234123|2011-03-19 11:05:04|words words words words
12482904328|2011-03-19 11:13:31|words words words words
--
12482904328|2011-03-19 15:50:48|More bolder than flow
More cumbersome than pleasure;
Goodbye rocky dump """.splitlines()

def get_haikus(lines):
    haiku = None
    for line in lines:
        try:
            ID, timestamp, txt = line.split('|')
            t = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
            ID = int(ID)
            if haiku and len(haiku[1]) ==3:
                yield haiku
            haiku = (timestamp, [txt])
        except ValueError: # happens on error with split(), time or int conversion
            haiku[1].append(line)
    else:
        yield haiku

# now get_haikus() returns tuple (timestamp, [lines])
for haiku in get_haikus(data):
    timestamp, text = haiku
    date = timestamp.split()[0]
    text = '\n'.join(text)
    print """{d}\n{txt}""".format(d=date, txt=text)