在对大型csv文件进行分块时,使用Multiprocessing加快熊猫的加载速度

时间:2019-04-01 05:51:05

标签: python pandas csv multiprocessing python-multiprocessing

我正在尝试读取包含700万行和10列的csv文件。我的硬件规格如下:

![enter image description here

我的策略是按块加载数据集:

import pandas as pd
df = pd.read_csv(filename, chunksize = 1000000, low_memory=False)

很遗憾,我收到此错误消息:

bash:fork:Cannot allocate memory

显然告诉我内存使用有问题。 因此,我决定使用多重处理来加载数据框:

import pandas as pd, numpy as np
from multiprocessing import Pool

def read_csv(filename):
        return pd.read_csv(filename, chunksize = 1000000, low_memory=False)

if __name__ == '__main__':
        pool = Pool(processes = 6)
        df_list = pool.map(read_csv, 'm_datasets.csv.gz')

但是我收到一个错误:

  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
FileNotFoundError: [Errno 2] File b'm' does not exist: b'm'

运行脚本会显示相同的错误,但字母不同:

FileNotFoundError: File b'a' does not exist

这里似乎有什么问题?


我已经通过使用apply_async解决了上述问题:

if __name__ == '__main__':
        pool = Pool(processes = 6)
        df_list = pool.apply_async(read_csv,'m_datasets.csv.gz')
        for i in df_list:
                res = i.get()

但是我有一个新问题:

Traceback (most recent call last):

      File "eda.py", line 15, in <module>
        for i in df_list:
    TypeError: 'ApplyResult' object is not iterable

但是当我使用时:

if __name__ == '__main__':
        jobs = []
        pool = Pool(5)
        lists = pool.apply_async(file_reader, ['m_datasets.csv.gz'])
        for i in lists.get():
                print(i)

我收到:

Traceback (most recent call last):
  File "eda.py", line 12, in <module>
    for i in lists.get():
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<pandas.io.parsers.TextFileReader object at 0x7f2417ebf748>'. Reason: 'AttributeError("Can't pickle local object '_make_date_converter.<locals>.converter'",)'

0 个答案:

没有答案