应用错误收集

我有2个类似于“数据框”的大型csv数据集，每个数据集都有4列。 column1 + column2是键，column4是我要比较的值。 key1和key2相同，能够比较val，key1和key2可以重复，key1大约有1M唯一记录，key2有4000唯一记录。

col1, col2, col3, col4
key1, key2, ignore, val

使用数据框很容易，但是每个数据集100GB时，无法将其全部加载到内存中，而pandas数据框限制了对象的48GB大小。

我目前有一个程序：

1. iterate dataset1 and get a unique column1 set (1M records).
2. load the set into a queue and build a threadpool.
3. Each thread will get 10k records from queue and iterate 2 datasets to build 2 hashmaps, (key1+key2) as key, and column4 as val.
4. compare hashmap and output the results to a singal CSV file.

这将导致5个小时的运行时间。我的问题是要获得大约30分钟的运行时间吗？重新设计程序也是可以的。

先谢谢了。

如何比较以col1 + col2作为键的2个大型csv数据集的一列？

0 个答案: