我正在使用Pandas(0.17.1)的read_csv函数处理两个数据集,但是当数据集为4.7GB和3.5GB时,我得到:
pandas.parser.CParserError: Error tokenizing data. C error: out of memory
/var/spool/gridscheduler/execd/node018/job_scripts/22813: line 10: 99644 Segmentation fault (core dumped)
当数据集为3.3GB和2.5GB时,一切正常。我怀疑这是因为堆栈大小,但在使用ulimit -a
进行检查时,会显示以下信息:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) 4294967296
pending signals (-i) 2062226
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 2062226
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
pd.show_versions()显示:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.4.5.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.17.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.11.1
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.0.3
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: None
该机器有512GB的RAM,正在使用CentOS。为什么我摄入~8GB会出现这个错误?