Question

我正在寻找一种从具有多个注释符号的文件中提取数据的方法。输入文件类似于：

# filename: sample.txt
# Comment 1
# Comment 2
$ Comment 3
1,10
2,20
3,30
4,40
# Comment 4

我似乎只能使用以下代码删除一个注释类型，但无法找到有关如何删除两者的任何文档。

import numpy as np
data = np.loadtxt('sample.txt',comments="#") # I need to also filter out '$'

我可以用任何其他方法来实现这个目标吗？

Answer 1

只需使用注释列表，例如：

data = np.loadtxt('sample.txt',comments=['#', '$', '@'])

Answer 2

我会创建一个忽略评论的生成器，然后将其传递给np.genfromtxt()：

gen = (r for r in open('sample.txt') if not r[0] in ('$', '#'))
a = np.genfromtxt(gen, delimiter=',')

Answer 3

对于这种情况，您需要在输入上使用standard-python循环，例如：像这样的东西：

data = []
with open("input.txt") as fd:
    for line in fd:
        if line.startswith('#') or line.startswith('$'):
            continue
        data.append(map(int, line.strip().split(',')))

print data

输出：

[[1, 10], [2, 20], [3, 30], [4, 40]]

Answer 4

由于您的行只包含注释或您的数据，因此我只需在使用numpy处理文件之前读入该文件。注释行将使用正则表达式终止。

import re
from StringIO import StringIO
import numpy as np
with open('sample.txt', 'r') as f:
    data = re.sub(r'\s*[#\$].*\n', '', f.read())
data = np.genfromtxt(StringIO(data), dtype=int, delimiter=',')

这将为您提供所需的numpy数组data。请注意，如果一行（意外地）以一些空格开头，后跟一个注释字符，则此方法仍然有效。

Answer 5

我查看了numpy.loadtxt代码，并且不可能使用多个字符进行评论，因为它们使用str.split：https://github.com/numpy/numpy/blob/v1.8.1/numpy/lib/npyio.py#L790

我认为您可以逐行加载文件，检查该行是否包含注释，然后将其传递给numpy.fromstring。

Answer 6

如果您想保持完整的loadtxt电源，您只需修改它就可以满足您的需求。正如David Marek所指出的那样，评论被删除的行是this one

line = asbytes(line).split(comments)[0].strip(asbytes('\r\n'))

变为：

for com in comments:
    line = asbytes(line).split(com)[0]
line = line.strip(asbytes('\r\n'))

您还需要更改L717：

comments = asbytes(comments)

变成：

comments = [asbytes(com) for com in comments]

如果您想保持完全兼容性，

if isinstance(comments, basestring):
    comments = [comments]

使用numpy过滤掉多个注释符号

6 个答案: