错误c:内存不足。在具有512GB RAM的机器上使用Pandas进行分段故障

时间:2016-09-06 22:49:21

标签: python pandas out-of-memory

我正在使用Pandas(0.17.1)的read_csv函数处理两个数据集,但是当数据集为4.7GB和3.5GB时,我得到:

pandas.parser.CParserError: Error tokenizing data. C error: out of memory
/var/spool/gridscheduler/execd/node018/job_scripts/22813: line 10: 99644 Segmentation fault (core dumped)

当数据集为3.3GB和2.5GB时,一切正常。我怀疑这是因为堆栈大小,但在使用ulimit -a进行检查时,会显示以下信息:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) 4294967296
pending signals                 (-i) 2062226
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2062226
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

pd.show_versions()显示:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.4.5.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None

该机器有512GB的RAM,正在使用CentOS。为什么我摄入~8GB会出现这个错误?

0 个答案:

没有答案