pandas中的SIGSEGV错误Series.rank(升序=假)

时间:2016-06-14 06:50:20

标签: python-3.x numpy pandas hdf5

1,我想阅读一个hdf5文件并对其进行排名。

import pandas as pd
def test_df_ranks(f):
    df = pd.read_hdf(f, key="t")
    print (df.shape)
    print (type(df))
    print (df)
    s=df.non_current_asset_to_total_asset
    #s.rank()     # rank() work properly 
    s.rank(ascending=False)  #rank(ascending=False) crash 

然后我收到了SIGSEGV错误。 以下是verison列表:

numpy==1.11.0
pandas==0.17.1
pymongo==3.2.2
python-dateutil==2.5.3
pytz==2016.4
ricequant-utility==0.1.0
six==1.10.0
tables==3.2.2
os: 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
gcc: 4.8.5 

我尝试使用gdb进行堆栈,但各种堆栈显示:

#1:
 ....
#7  OBJECT_compare (ip1=0x47a3ef4b2e420, ip2=0x7f5c5413f128, __NPY_UNUSED_TAGGEDap=0x7f5cd0100760) at numpy/core/src/multiarray/arraytypes.c.src:2753
#8  0x00007f5d0142c50e in npy_aquicksort (vv=vv@entry=0x7f5c5413f060, tosort=tosort@entry=0x7f5c5413cc80, num=num@entry=52, varr=varr@entry=0x7f5cd0100760) at numpy/core/src/npysort/quicksort.c.src:480
#9  0x00007f5d0139a78a in _new_argsortlike (op=op@entry=0x7f5cd0100760, axis=0, argsort=argsort@entry=0x7f5d0142c310 <npy_aquicksort>, argpart=argpart@entry=0x0, kth=kth@entry=0x0, nkth=nkth@entry=0)
at numpy/core/src/multiarray/item_selection.c:1035
#10 0x00007f5d0139dd7b in PyArray_ArgSort (op=op@entry=0x7f5cd0100760, axis=0, which=<optimized out>) at numpy/core/src/multiarray/item_selection.c:1309
#11 0x00007f5d013dd012 in array_argsort (self=0x7f5cd0100760, args=<optimized out>, kwds=<optimized out>) at numpy/core/src/multiarray/methods.c:1278
#12 0x00007f5cf4eef28f in __Pyx_PyObject_Call (func=0x7f5cd1a1acc8, arg=0x7f5d0f900048, kw=0x0) at pandas/algos.c:201388
#13 0x00007f5cf504e006 in __pyx_pf_6pandas_5algos_8rank_1d_generic (__pyx_v_in_arr=__pyx_v_in_arr@entry=0x7f5cd0100620, __pyx_v_retry=1, __pyx_v_ties_method=0x7f5cf6999768, __pyx_v_ascending=0x7f5d0f6bd700 <_Py_FalseStruct>, 
__pyx_v_na_option=<optimized out>, __pyx_v_pct=0x7f5d0f6bd700 <_Py_FalseStruct>, __pyx_self=<optimized out>) at pandas/algos.c:14942
#14 0x00007f5cf5050481 in __pyx_pw_6pandas_5algos_9rank_1d_generic (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=0x7f5cd8659488) at pandas/algos.c:14439
#15 0x00007f5d0f3b9477 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#16 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#17 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#18 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#19 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#20 0x00007f5d0f3b8e40 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#21 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#22 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#23 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#24 0x00007f5d0f32a4b3 in function_call () from /lib64/libpython3.4m.so.1.0
#25 0x00007f5d0f301dcc in PyObject_Call () from /lib64/libpython3.4m.so.1.0
#26 0x00007f5d0f3b57c9 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0

...

现在堆栈是:

#0  0x00007ffff6c985f7 in raise () from /lib64/libc.so.6
#1  0x00007ffff6c99ce8 in abort () from /lib64/libc.so.6
#2  0x00007ffff6cd8317 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff6ce0023 in _int_free () from /lib64/libc.so.6
#4  0x00007fffd15785a9 in H5FL_reg_gc_list () from /lib64/libhdf5.so.8
#5  0x00007fffd1578626 in H5FL_reg_gc () from /lib64/libhdf5.so.8
#6  0x00007fffd157b0be in H5FL_garbage_coll () from /lib64/libhdf5.so.8
#7  0x00007fffd157b34e in H5FL_term_interface () from /lib64/libhdf5.so.8
#8  0x00007fffd14ae466 in H5_term_library () from /lib64/libhdf5.so.8
#9  0x00007ffff6c9be69 in __run_exit_handlers () from /lib64/libc.so.6
#10 0x00007ffff6c9beb5 in exit () from /lib64/libc.so.6
#11 0x00007ffff6c84b1c in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000400b89 in _start ()

2,我将数据保存到csv。然后通过pd.read_csv()获取pd.Series    series.rank(升序= True)或series.rank(升序= Flase)效果都很好。

3,表格中可能出现问题?还是hdf5?我的hdf5数据:https://github.com/HaoXJ/codefail/blob/master/data/test.h5

4,需要你们的帮助。

1 个答案:

答案 0 :(得分:0)

首先,您的non_current_asset_to_total_asset列没有任何非NaN值,但它似乎是一个numpy或pandas错误。您可能想要检查它是否已被引发为错误here。或者开一个新问题......

In [1]: fn = r'D:\download\test.h5'

In [2]: df = pd.read_hdf(fn, key='t')

列出non_current_asset_to_total_asset不是NaN

的行
In [3]: df[pd.notnull(df.non_current_asset_to_total_asset)]
Out[3]:
Empty DataFrame
Columns: [pb_ratio, pe_ratio_1, inc_operating_revenue, inc_total_asset, non_current_asset_to_total_asset]
Index: []

注意:没有non_current_asset_to_total_asset不是NaN

的行
In [4]: df.head()
Out[4]:
            pb_ratio pe_ratio_1 inc_operating_revenue inc_total_asset non_current_asset_to_total_asset
000022.XSHE   7.4091    14.9739               30.5996         13.1342                              NaN
000089.XSHE   1.7244    14.3574                7.5837          2.8343                              NaN
000099.XSHE   1.7782    23.6805                8.7495         -0.0933                              NaN
000429.XSHE   1.7264    17.5882               15.1496         -0.9485                              NaN
000507.XSHE   1.1563    46.9562               26.9032          4.4909                              NaN

rank(ascending=True)有效:

In [10]: df.non_current_asset_to_total_asset.rank(ascending=True).head()
Out[10]:
000022.XSHE   NaN
000089.XSHE   NaN
000099.XSHE   NaN
000429.XSHE   NaN
000507.XSHE   NaN
Name: non_current_asset_to_total_asset, dtype: float64

排名(升序= 错误让我的iPython崩溃:

In [5]: df.non_current_asset_to_total_asset.rank(ascending=False)

崩溃信息:

<EXE NAME="multiarray.cp35-win_amd64.pyd" FILTER="CMI_FILTER_THISFILEONLY">
    <MATCHING_FILE NAME="multiarray.cp35-win_amd64.pyd" SIZE="1510912" CHECKSUM="0xD8B922AB" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="05/02/2016 21:19:46" UPTO_LINK_DATE="05/02/2016 21:19:46" EXPORT_NAME="multiarray.cp35-win_amd64.pyd" EXE_WRAPPER="0x0" />
</EXE>
<EXE NAME="kernel32.dll" FILTER="CMI_FILTER_THISFILEONLY">
    <MATCHING_FILE NAME="kernel32.dll" SIZE="1163264" CHECKSUM="0xADFC88B8" BIN_FILE_VERSION="6.1.7601.23418" BIN_PRODUCT_VERSION="6.1.7601.23418" PRODUCT_VERSION="6.1.7601.18015" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="6.1.7601.18015 (win7sp1_gdr.121129-1432)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERDATEHI="0x0" VERDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0x122E58" LINKER_VERSION="0x60001" UPTO_BIN_FILE_VERSION="6.1.7601.23418" UPTO_BIN_PRODUCT_VERSION="6.1.7601.23418" LINK_DATE="04/09/2016 07:00:43" UPTO_LINK_DATE="04/09/2016 07:00:43" EXPORT_NAME="KERNEL32.dll" VER_LANGUAGE="English (United States) [0x409]" EXE_WRAPPER="0x0" />
</EXE>
</DATABASE>

我的版本:

In [5]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 21.2.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.8.7
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1