Question

我试图在python 2.7.5中解析一个空格分隔的文本文件，看起来有点像：

variable         description      useless data
a1                asdfsdf           2342354 
            Sometimes it goes into further detail about the 
            variable/description here
a2                asdsfda           32123

编辑：对于开头添加的空格感到抱歉，我没有看到它们

我希望能够将文本文件拆分为一个包含2个独立列中的变量和描述的数组，并删除所有无用数据并跳过任何不以字符串开头的行。我设置代码的方式是：

import os
import pandas
import numpy
os.chdir('C:\folderwithfiles')
f = open('Myfile.txt', 'r')
lines = f.readlines()
for line in lines:
    if not line.strip():
        continue
    else:
        print(line)
print(lines)

截至目前，此代码会跳过可变行之间的大多数描述性行，但有些仍会在解析中弹出。如果我可以帮助解决我的线路跳线问题，或者帮助我开始使用非常棒的柱子形成部件！我在python中也没有太多的经验。谢谢！

编辑：代码

之前的文件的一部分

CASEID            (id) Case Identification                   1   15   AN



MIDX              (id) Index to Birth History                16   1  No
                           1:6

后：

CASEID            (id) Case Identification                   1   15   AN

MIDX              (id) Index to Birth History                16   1  No
                           1:6

Answer 1

您希望过滤掉以空格开头的行，并拆分所有其他行以获取前两列。

将这两条规则翻译成代码：

with open('Myfile.txt') as f:
    for line in f:
        if not line.startswith(' '):
            variable, description, _ = line.split(None, 2)
            print(variable, description)

这就是它的全部内容。

或者，直接翻译：

with open('Myfile.txt') as f:
    non_descriptions = filter(lambda line: not line.startswith(' '), f)
    values = (line.split(None, 2) for line in non_descriptions)

现在values是一个超过(variable, description)元组的迭代器。这很好，很有说服力。第一行表示“过滤掉以空格开头的行”。第二种方法是“拆分每一行以获得前两列”。（您可以将第一个编写为genexpr而不是filter，或者第二个编写为map而不是genexpr，但我认为这是最接近英文描述的。）

Answer 2

假设您的变量或描述中没有空格，这将起作用

with open('path/to/file') as infile:
    answer = []
    for line in file:
        if not line.strip():
            continue
        if line.startswith(' '): # skipping descriptions
            continue
        splits = line.split()
        var, desc = splits[:2]
        answer.append([var, desc])

Answer 3

如果你正在使用熊猫，试试这个：

from pandas import read_csv
data = read_csv('file.txt', error_bad_lines=False).drop(['useless data'])

如果您的文件是固定宽度（而不是逗号分隔值），请使用pandas.read_fwf

跳过行并将它们拆分为python文本解析器中的列

3 个答案: