Question

我有大约30个数据文件，我需要提取第4，第5和第6列。然后跳过14列并抓住接下来的3列，依此类推，直到文件末尾。每个数据文件大约有400行和17000列。到目前为止，我有这个：

file_list = glob.glob('*.dat')

with open("result.dat", "wb") as outfile:
    for f in file_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())

data = np.loadtxt('result.dat')

arr = np.array(data)
a = arr[:, 4:-1:17]
b = arr[:, 5:-1:17]
c = arr[:, 6:-1:17]

这是编写一个名为result.dat的文件，其中包含多个文件中的所有数据，然后我提取了我需要的列。但是，创建数组需要很长时间，因为它正在编写我不需要的所有信息。有没有办法只读入我感兴趣的特定列而不是result.dat文件？这应该会显着缩短时间。

Answer 1

numpy.loadtxt是一个纯粹的python实现，它使得某种程度上变慢。使用pandas.read_csv()会更快。您也不需要使用完整内容编写另一个文件（如果您不需要此文件用于其他目的）。

这是使用pandas.read_csv的等效代码：

import glob
import pandas as pd

file_list = glob.glob('*.dat')
cols = [4, 21, 38] # add more columns here

df = pd.DataFrame()

for f in file_list:
    df = df.append(
        pd.read_csv(f, delimiter='\s+', header=None, usecols=cols),
        ignore_index=True,    
    )

arr = df.values

等效的numpy代码是：

import glob
import numpy as np

file_list = glob.glob('*.dat')
cols = [0, 1, 2]  # add more columns here

data = []
for f in file_list:
    data.append(np.loadtxt(f, usecols=cols))

arr = np.vstack(data)

如果用10个随机数文件（10000,10）计时。

熊猫解决方案： 0.95秒

numpy解决方案： 2.6秒

Answer 2

numpy.loadtxt函数接受可选的usecols参数。

您可以通过以下方式生成列索引：

usecols=set(xrange(4, num_col, 17)) | set(xrange(5, num_col, 17)) | set(xrange(6, num_col, 17))

Answer 3

loadtxt接受任何迭代，包括生成器。您可以遍历文件，但直接将它们提供给loadtxt而不是编写中间文件。不能保证它会节省很多时间，但可能值得进行实验。

这是我的测试：

def foo(filelist):
    for name in filelist:
        with open(name) as f:
            for line in f:
                yield line

一个简单的测试文件

In [71]: cat urls.txt
one.com
two.url
three.four

使用foo阅读2次：

In [72]: list(foo(['urls.txt','urls.txt']))
Out[72]: 
['one.com\n',
 'two.url\n',
 'three.four\n',
 'one.com\n',
 'two.url\n',
 'three.four\n']

在loadtxt中使用它：

In [73]: np.loadtxt(foo(['urls.txt','urls.txt']),dtype=str,delimiter='.',usecols=[1])
Out[73]: 
array(['com', 'url', 'four', 'com', 'url', 'four'], 
      dtype='|S4')

另一种方法是加载每个文件，在列表中收集数组，并将它们连接起来。

使用'use 3 skip 14'的'usecols'有点尴尬。使用3个切片是一个不错的主意，但您不希望使用loadtxt执行此操作。

np.r_可能会让任务变得更轻松：

In [81]: np.r_[4:100:17, 5:100:17, 6:100:17]
Out[81]: 
array([ 4, 21, 38, 55, 72, 89,  5, 22, 39, 56, 73, 90,  6, 23, 40, 57, 74,
       91])

In [82]: np.sort(np.r_[4:100:17, 5:100:17, 6:100:17])
Out[82]: 
array([ 4,  5,  6, 21, 22, 23, 38, 39, 40, 55, 56, 57, 72, 73, 74, 89, 90,
       91])

usecols不必排序，因此您可以使用其中任何一种。

将多个数据文件中的某些列读入python中的一个文件

3 个答案: