用pandas追加hdfstore错误

时间:2014-09-12 19:47:48

标签: python pandas pytables hdfstore

我收到以下错误:

    exportStore.append(key, hdfStoreLocal, index = False, data_columns = True)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 911, in append
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 1270, in _write_to_group
    s.write(obj=value, append=append, complib=complib, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3605, in write
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py", line 3293, in create_axes
    raise e
ValueError: invalid itemsize in generic type tuple

有关为何会发生这种情况的任何想法?这是一个相当大的项目,所以我不确定我能提供什么代码,但这发生在第一个附加内容上。非常感谢任何帮助。

EDIT ::::::

显示版本结果:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: None
Cython: 0.20.2
numpy: 1.8.1
scipy: 0.13.3
statsmodels: None
IPython: 1.2.1
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 1.5
pytz: 2012c
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

信息结果:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61500 entries, 0 to 61499
Data columns (total 48 columns):
Sequential_Code_1        61500 non-null float64
Age_1                    61500 non-null float64
Sex_1                    61500 non-null object
Race_1                   61500 non-null object
Ethnicity_1              61500 non-null object
Principal_Code_1         61500 non-null object
Admitting_Code_1         61500 non-null object
Principal_Code_2         61500 non-null object
Other_Codes_1            61500 non-null object
Other_Codes_2            61500 non-null object
Other_Codes_3            61500 non-null object
Other_Codes_4            61500 non-null object
Other_Codes_5            61500 non-null object
Other_Codes_6            61500 non-null object
Other_Codes_7            61500 non-null object
Other_Codes_8            61500 non-null object
Other_Codes_9            61500 non-null object
Other_Codes_10           61500 non-null object
Other_Codes_11           61500 non-null object
Other_Codes_12           61500 non-null object
Other_Codes_13           61500 non-null object
Other_Codes_14           61500 non-null object
Other_Codes_15           61500 non-null object
Other_Codes_16           61500 non-null object
Other_Codes_17           61500 non-null object
Other_Codes_18           61500 non-null object
Other_Codes_19           61500 non-null object
Other_Codes_20           61500 non-null object
Other_Codes_21           61500 non-null object
Other_Codes_22           61500 non-null object
Other_Codes_23           61500 non-null object
Other_Codes_24           61500 non-null object
External_Code_1          61500 non-null object
Place_Code_1             61500 non-null object

目:

head       Sequential_Number_1  Age_1 Sex_1 Race_1  \
1128                   2.000000e+13     73             F             01   
2185                   2.000000e+13     52             M             01   
2202                   2.000000e+13     64             M             01   
2283                   2.000000e+13     72             F             01   
4471                   2.000000e+13     62             F             01 

1 个答案:

答案 0 :(得分:1)

问题是您需要指定min_itemsize,请参阅文档here

它控制列对于类似字符串的列的大小。如果你没有任何长度的任何值它失败(prob可能是一个更好的错误消息)。它将花费传递值的最大长度来确定它需要的大小。

指定这个的原因是说你要附加多个块。你可以在块2中有一个更长的字符串,这意味着列应该至少是那个大小,但只看到块1并没有告诉你这个。

进一步预先处理这些数据,使其不具有0-len字符串,而是使用np.nan作为缺失值(HDFstore / pandas)正确处理。