Question

出于某种原因，我将我的代码分为两部分;第一部分用C语言编写，第二部分用python编写。我在文件中写了C代码的输出并在python中使用它作为我的输入，现在我的问题是当我想将文件加载到numpy数组中需要大约18分钟，这是很多，我需要减少这个时间。 fie的大小约为300MB。

写入文件的C代码如下：

struct point {
    float fpr;
    float tpr;
    point(float x, float y)
    {
        fpr = x;
        tpr = y;
    }
};
vector<point> current_points;
// filling current_points ......
ofstream files;
files.open ("./allpoints.txt")
for(unsigned int i=0; i<current_points.size(); i++)
            files << current_points[i].fpr << '\t' << current_points[i].tpr << "\n";

在python中读取文件就像：

with open("./allpoints.txt") as f:
    just_comb = numpy.loadtxt(f) #The problem is here (took 18 minutes)

allpoints.txt就像这样（正如你可以看到它在2D维度上的某些点的协调）：

0.989703    1
0   0
0.0102975   0
0.0102975   0
1   1
0.989703    1
1   1
0   0
0.0102975   0
0.989703    1
0.979405    1
0   0
0.020595    0
0.020595    0
1   1
0.979405    1
1   1
0   0
0.020595    0
0.979405    1
0.969108    1
...
...
...
0   0
0.0308924   0
0.0308924   0
1   1
0.969108    1
1   1
0   0
0.0308924   0
0.969108    1
0.95881 1
0   0

现在我的问题是，有没有更好的方法来存储文件中的点矢量（类似二进制格式）并在python中将其读入2D numpy数组更快？

Answer 1

如果您需要预烘焙库解决方案，请使用HDF5。如果你想要更多没有依赖性的裸机，可以这样做：

files.write(reinterpret_cast<char*>(current_points.data()),
    current_points.size() * sizeof(point));

这将为您提供直接写入文件的简单2D浮点数组。然后，您可以使用[numpy.fromfile()][1]读取此文件。

Answer 2

您是否尝试过numpy.fromfile？

>>> import numpy
>>> data = numpy.fromfile('./allpoints.txt', dtype=float, count=-1, sep=' ')
>>> data = numpy.reshape(data, (len(data) / 2, 2))
>>> print(data[0:10])
[[ 0.989703   1.       ]
 [ 0.         0.       ]
 [ 0.0102975  0.       ]
 [ 0.0102975  0.       ]
 [ 1.         1.       ]
 [ 0.989703   1.       ]
 [ 1.         1.       ]
 [ 0.         0.       ]
 [ 0.0102975  0.       ]
 [ 0.989703   1.       ]]

对于300M输入文件，这需要20秒。

numpy loadtxt需要花费很多时间

2 个答案: