Question

有没有办法在读入数据帧之前或同时过滤数据？

例如，我有以下csv数据文件：

 time       Event    price     Volume
00:00:00.000, B,    920.5,    57
00:00:00.000, A,    920.75,   128
00:00:00.898, T,    920.75,   1
00:00:00.898, T,    920.75,   19
00:00:00.906, B,    920.5,    60
00:00:41.284, T,    920.75,   5
00:00:57.589, B,    920.5,    53
00:01:06.745, T,    920.75,   3
00:01:06.762, T,    920.75,   2

我想阅读仅'Event'=='T'和'Volume'>=100的数据行。如果我们读取整个数据集然后过滤掉数据（这就是我现在正在做的事情），这很容易实现。

我拥有的每个文件都是10MB，并且有数千个（总共大约15 GB的数据），这个过程将需要永远。所以我想知道是否有办法在读入时过滤数据，或者其他一些方法来加快速度。也许改用数据库？

Answer 1

我不相信有一种方法可以过滤你想在csv文件中读取的内容。

尝试使用HDFStore。它为阅读和写作提供了极好的性能。您可以从CSV读取所有数据并将其保存到H5文件，并将这些H5文件用作数据库。一些比较结果在此页面上，

http://pandas.pydata.org/pandas-docs/dev/io.html

我在这里复制结果进行比较，

写作表现，

In [15]: %timeit test_hdf_fixed_write(df)
1 loops, best of 3: 237 ms per loop

In [26]: %timeit test_hdf_fixed_write_compress(df)
1 loops, best of 3: 245 ms per loop

In [16]: %timeit test_hdf_table_write(df)
1 loops, best of 3: 901 ms per loop

In [27]: %timeit test_hdf_table_write_compress(df)
1 loops, best of 3: 952 ms per loop

In [17]: %timeit test_csv_write(df)
1 loops, best of 3: 3.44 s per loop

阅读效果，

In [19]: %timeit test_hdf_fixed_read()
10 loops, best of 3: 19.1 ms per loop

In [28]: %timeit test_hdf_fixed_read_compress()
10 loops, best of 3: 36.3 ms per loop

In [20]: %timeit test_hdf_table_read()
10 loops, best of 3: 39 ms per loop

In [29]: %timeit test_hdf_table_read_compress()
10 loops, best of 3: 60.6 ms per loop

In [22]: %timeit test_csv_read()
1 loops, best of 3: 620 ms per loop

在读入pandas数据帧之前过滤/选择数据

1 个答案: