Question

我有一个具有以下高级结构的文本文件：

CATEG:
DATA1
DATA2
...
DATA_N
CATEG:
DATA1
....

我希望打开此文本文件，并为CATEG：的每个实例进行解析，以分隔两者之间的内容。但是，我对open方法以及在阅读时如何处理每一行中的新行感到非常烦恼。

即使用f = open('mydata.txt', 'r')然后使用f.readlines()会导致很多不必要的换行符，并且使按上述数据结构拆分变得很烦人。有人有提示吗？不幸的是，令人讨厌的是数据集。

Answer 1

尝试read（）。splitlines（）。

例如：

{{1}}

Answer 2

尝试以下代码：

with open('mydata.txt') as f:
  for line in f:
    line = line.strip(' \t\r\n')  # remove spaces and line endings
    if line.ednswith(';'):
      pass # this is category definition
    else:
      pass # this is data line

Answer 3

您可以使用itertools.groupby：

from itertools import groupby

with open(filename) a f:
    categs = [list(group) for (key, group) in groupby(f.splitlines(), key='CATEG:')]

Answer 4

尝试一下：

with open('text.txt') as file:
text = file.read()
text = text.replace('\n', ' ')
s = text.split('CATEG:')
s = [x.strip() for x in s if x != '']
print(s)

Answer 5

在您的序列周围写一些包装纸，以去除所有换行符：

def newline_stripper(seq):
    for s in seq:
        # or change this to just s.rstrip() to remove all trailing whitespace
        yield s.rstrip('\n')

然后在进行迭代时用它包装文件对象：

with open('text_file.txt') as f:
    for line in newline_stripper(f):
        # do something with your now newline-free lines

这将保留您一次一行的文件读取，而不是一次read().splitlines()那样一次全部读取。

在Python中从文本文件中提取数据

5 个答案: