1,我想阅读一个hdf5文件并对其进行排名。
import pandas as pd
def test_df_ranks(f):
df = pd.read_hdf(f, key="t")
print (df.shape)
print (type(df))
print (df)
s=df.non_current_asset_to_total_asset
#s.rank() # rank() work properly
s.rank(ascending=False) #rank(ascending=False) crash
然后我收到了SIGSEGV错误。 以下是verison列表:
numpy==1.11.0
pandas==0.17.1
pymongo==3.2.2
python-dateutil==2.5.3
pytz==2016.4
ricequant-utility==0.1.0
six==1.10.0
tables==3.2.2
os: 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
gcc: 4.8.5
我尝试使用gdb进行堆栈,但各种堆栈显示:
#1:
....
#7 OBJECT_compare (ip1=0x47a3ef4b2e420, ip2=0x7f5c5413f128, __NPY_UNUSED_TAGGEDap=0x7f5cd0100760) at numpy/core/src/multiarray/arraytypes.c.src:2753
#8 0x00007f5d0142c50e in npy_aquicksort (vv=vv@entry=0x7f5c5413f060, tosort=tosort@entry=0x7f5c5413cc80, num=num@entry=52, varr=varr@entry=0x7f5cd0100760) at numpy/core/src/npysort/quicksort.c.src:480
#9 0x00007f5d0139a78a in _new_argsortlike (op=op@entry=0x7f5cd0100760, axis=0, argsort=argsort@entry=0x7f5d0142c310 <npy_aquicksort>, argpart=argpart@entry=0x0, kth=kth@entry=0x0, nkth=nkth@entry=0)
at numpy/core/src/multiarray/item_selection.c:1035
#10 0x00007f5d0139dd7b in PyArray_ArgSort (op=op@entry=0x7f5cd0100760, axis=0, which=<optimized out>) at numpy/core/src/multiarray/item_selection.c:1309
#11 0x00007f5d013dd012 in array_argsort (self=0x7f5cd0100760, args=<optimized out>, kwds=<optimized out>) at numpy/core/src/multiarray/methods.c:1278
#12 0x00007f5cf4eef28f in __Pyx_PyObject_Call (func=0x7f5cd1a1acc8, arg=0x7f5d0f900048, kw=0x0) at pandas/algos.c:201388
#13 0x00007f5cf504e006 in __pyx_pf_6pandas_5algos_8rank_1d_generic (__pyx_v_in_arr=__pyx_v_in_arr@entry=0x7f5cd0100620, __pyx_v_retry=1, __pyx_v_ties_method=0x7f5cf6999768, __pyx_v_ascending=0x7f5d0f6bd700 <_Py_FalseStruct>,
__pyx_v_na_option=<optimized out>, __pyx_v_pct=0x7f5d0f6bd700 <_Py_FalseStruct>, __pyx_self=<optimized out>) at pandas/algos.c:14942
#14 0x00007f5cf5050481 in __pyx_pw_6pandas_5algos_9rank_1d_generic (__pyx_self=<optimized out>, __pyx_args=<optimized out>, __pyx_kwds=0x7f5cd8659488) at pandas/algos.c:14439
#15 0x00007f5d0f3b9477 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#16 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#17 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#18 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#19 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#20 0x00007f5d0f3b8e40 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#21 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#22 0x00007f5d0f3b7a12 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
#23 0x00007f5d0f3b9f3e in PyEval_EvalCodeEx () from /lib64/libpython3.4m.so.1.0
#24 0x00007f5d0f32a4b3 in function_call () from /lib64/libpython3.4m.so.1.0
#25 0x00007f5d0f301dcc in PyObject_Call () from /lib64/libpython3.4m.so.1.0
#26 0x00007f5d0f3b57c9 in PyEval_EvalFrameEx () from /lib64/libpython3.4m.so.1.0
...
现在堆栈是:
#0 0x00007ffff6c985f7 in raise () from /lib64/libc.so.6
#1 0x00007ffff6c99ce8 in abort () from /lib64/libc.so.6
#2 0x00007ffff6cd8317 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff6ce0023 in _int_free () from /lib64/libc.so.6
#4 0x00007fffd15785a9 in H5FL_reg_gc_list () from /lib64/libhdf5.so.8
#5 0x00007fffd1578626 in H5FL_reg_gc () from /lib64/libhdf5.so.8
#6 0x00007fffd157b0be in H5FL_garbage_coll () from /lib64/libhdf5.so.8
#7 0x00007fffd157b34e in H5FL_term_interface () from /lib64/libhdf5.so.8
#8 0x00007fffd14ae466 in H5_term_library () from /lib64/libhdf5.so.8
#9 0x00007ffff6c9be69 in __run_exit_handlers () from /lib64/libc.so.6
#10 0x00007ffff6c9beb5 in exit () from /lib64/libc.so.6
#11 0x00007ffff6c84b1c in __libc_start_main () from /lib64/libc.so.6
#12 0x0000000000400b89 in _start ()
2,我将数据保存到csv。然后通过pd.read_csv()获取pd.Series series.rank(升序= True)或series.rank(升序= Flase)效果都很好。
3,表格中可能出现问题?还是hdf5?我的hdf5数据:https://github.com/HaoXJ/codefail/blob/master/data/test.h5。4,需要你们的帮助。
答案 0 :(得分:0)
首先,您的non_current_asset_to_total_asset
列没有任何非NaN值,但它似乎是一个numpy或pandas错误。您可能想要检查它是否已被引发为错误here。或者开一个新问题......
In [1]: fn = r'D:\download\test.h5'
In [2]: df = pd.read_hdf(fn, key='t')
列出non_current_asset_to_total_asset
不是NaN
In [3]: df[pd.notnull(df.non_current_asset_to_total_asset)]
Out[3]:
Empty DataFrame
Columns: [pb_ratio, pe_ratio_1, inc_operating_revenue, inc_total_asset, non_current_asset_to_total_asset]
Index: []
注意:没有non_current_asset_to_total_asset
不是NaN
In [4]: df.head()
Out[4]:
pb_ratio pe_ratio_1 inc_operating_revenue inc_total_asset non_current_asset_to_total_asset
000022.XSHE 7.4091 14.9739 30.5996 13.1342 NaN
000089.XSHE 1.7244 14.3574 7.5837 2.8343 NaN
000099.XSHE 1.7782 23.6805 8.7495 -0.0933 NaN
000429.XSHE 1.7264 17.5882 15.1496 -0.9485 NaN
000507.XSHE 1.1563 46.9562 26.9032 4.4909 NaN
rank(ascending=True)
有效:
In [10]: df.non_current_asset_to_total_asset.rank(ascending=True).head()
Out[10]:
000022.XSHE NaN
000089.XSHE NaN
000099.XSHE NaN
000429.XSHE NaN
000507.XSHE NaN
Name: non_current_asset_to_total_asset, dtype: float64
但排名(升序= 错误)让我的iPython崩溃:
In [5]: df.non_current_asset_to_total_asset.rank(ascending=False)
崩溃信息:
<EXE NAME="multiarray.cp35-win_amd64.pyd" FILTER="CMI_FILTER_THISFILEONLY">
<MATCHING_FILE NAME="multiarray.cp35-win_amd64.pyd" SIZE="1510912" CHECKSUM="0xD8B922AB" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="05/02/2016 21:19:46" UPTO_LINK_DATE="05/02/2016 21:19:46" EXPORT_NAME="multiarray.cp35-win_amd64.pyd" EXE_WRAPPER="0x0" />
</EXE>
<EXE NAME="kernel32.dll" FILTER="CMI_FILTER_THISFILEONLY">
<MATCHING_FILE NAME="kernel32.dll" SIZE="1163264" CHECKSUM="0xADFC88B8" BIN_FILE_VERSION="6.1.7601.23418" BIN_PRODUCT_VERSION="6.1.7601.23418" PRODUCT_VERSION="6.1.7601.18015" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="6.1.7601.18015 (win7sp1_gdr.121129-1432)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERDATEHI="0x0" VERDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0x122E58" LINKER_VERSION="0x60001" UPTO_BIN_FILE_VERSION="6.1.7601.23418" UPTO_BIN_PRODUCT_VERSION="6.1.7601.23418" LINK_DATE="04/09/2016 07:00:43" UPTO_LINK_DATE="04/09/2016 07:00:43" EXPORT_NAME="KERNEL32.dll" VER_LANGUAGE="English (United States) [0x409]" EXE_WRAPPER="0x0" />
</EXE>
</DATABASE>
我的版本:
In [5]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US
pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 21.2.1
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.8.7
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: 0.2.1