Question

大家，我有一个大文件，格式如下。数据在＆＃34;块＆＃34;格式。一个＆＃34;块＆＃34;包含三行：时间T，用户U和内容W. 例如，这是一个块：

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

因为我只会使用包含特定关键字的块。我逐块切割原始海量数据中的数据，而不是将整个数据转储到内存中。每次在一个块中读取，如果包含单词＆＃34; bike＆＃34;的内容行，则将此块写入磁盘。

您可以使用以下两个块来测试您的脚本。

T   2009-06-11 21:57:23
U   tracygazzard
W   David Letterman is good man

T   2009-06-11 21:57:23
U   charilie
W   i want a bike

我试图逐行完成工作：

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

for line in data:
    if line.find("bike")!= -1:
    output.write(line)

Answer 1

您可以使用正则表达式：

import re
data = open("OWS.txt", 'r').read()   # Read the entire file into a string
output = open("result.txt", 'w')

for match in re.finditer(
    r"""(?mx)          # Verbose regex, ^ matches start of line
    ^T\s+(?P<T>.*)\s*  # Match first line
    ^U\s+(?P<U>.*)\s*  # Match second line
    ^W\s+(?P<W>.*)\s*  # Match third line""", 
    data):
        if "bike" in match.group("W"):
            output.write(match.group())  # outputs entire match

Answer 2

由于块的格式是常量，您可以使用列表来保存块，然后查看该块中是否有bike：

data = open("OWS.txt", 'r')
output = open("result.txt", 'w')

chunk = []
for line in data:
    chunk.append(line)
    if line[0] == 'W':
        if 'bike' in str(chunk):
            for line in chunk:
                output.write(line)
        chunk = []

使用Python逐块切片数据

2 个答案: