Question

我有一个svmlight格式的大文本文件。它包含以分号分隔的空格分隔的索引（int）和值（float）对的字符串。

示例：

1:2 4:12 5:3 ...
2:34 4:2 12:5 ...

文件可能非常大，无法在numpy数组中一次性读取整个文件。

如何最有效地阅读chunks这样的文件？或者更正确的问题是在这种情况下如何有效地创建numpy数组？

现在我使用以下代码。 lines是从文件中读取的行列表。

    x = []
    for line in lines:
        tmp = re.split('[ :]', line)
        out = [0] * len(self.__varnames)
        for i in range(0, len(tmp), 2):
            out[int(tmp[i])] = float(tmp[i+1])
        x.append(out)
    x = np.asarray(x)

相对于我的其他尝试，它相当快，但我相信它可以加速。

注意：

1）来自sklearn包的load_svmlight_file作为一个整体读取文件，它无法读取没有前导类标签的文件，这是可选的。

2）我期望找到一个没有外部库依赖的快速解决方案（如果存在的话）。但当然允许numpy，scipy。

Answer 1

您可以以块的形式提供读取行。

这是两种情况的简单解决方案。

def yield_file(infile):
    '''(file_path) >> line
    A simple generator that yields the lines of a file.
    '''

    with open(infile, 'r') as f:
        for line in f:
            yield line


def read_in_chunks(infile, chunk_size=1024):
    '''(file_path, int) >> str
    Simple generator to read a file in chunks.
    '''

    with open(infile,'r') as f:
        while True:
            data = f.read(chunk_size)
            if not data:
                break
            yield data

Answer 2

你可以更清洁，不需要正则表达式：

x = []
for line in lines:
    out = [0] * len(self.__varnames)
    for entry in line.split():
        index, value = entry.split(':')
        out[int(index)] = float(value)
    x.append(out)
x = np.array(x)

不知道这是否更快。需要测试。

通过块快速读取svmlight文件

2 个答案: