我正在尝试读取包含700万行和10列的csv文件。我的硬件规格如下:
我的策略是按块加载数据集:
import pandas as pd
df = pd.read_csv(filename, chunksize = 1000000, low_memory=False)
很遗憾,我收到此错误消息:
bash:fork:Cannot allocate memory
显然告诉我内存使用有问题。 因此,我决定使用多重处理来加载数据框:
import pandas as pd, numpy as np
from multiprocessing import Pool
def read_csv(filename):
return pd.read_csv(filename, chunksize = 1000000, low_memory=False)
if __name__ == '__main__':
pool = Pool(processes = 6)
df_list = pool.map(read_csv, 'm_datasets.csv.gz')
但是我收到一个错误:
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
FileNotFoundError: [Errno 2] File b'm' does not exist: b'm'
运行脚本会显示相同的错误,但字母不同:
FileNotFoundError: File b'a' does not exist
这里似乎有什么问题?
我已经通过使用apply_async解决了上述问题:
if __name__ == '__main__':
pool = Pool(processes = 6)
df_list = pool.apply_async(read_csv,'m_datasets.csv.gz')
for i in df_list:
res = i.get()
但是我有一个新问题:
Traceback (most recent call last):
File "eda.py", line 15, in <module>
for i in df_list:
TypeError: 'ApplyResult' object is not iterable
但是当我使用时:
if __name__ == '__main__':
jobs = []
pool = Pool(5)
lists = pool.apply_async(file_reader, ['m_datasets.csv.gz'])
for i in lists.get():
print(i)
我收到:
Traceback (most recent call last):
File "eda.py", line 12, in <module>
for i in lists.get():
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<pandas.io.parsers.TextFileReader object at 0x7f2417ebf748>'. Reason: 'AttributeError("Can't pickle local object '_make_date_converter.<locals>.converter'",)'