我如何在多索引数据框中找到非Nan交易模式?

时间:2018-05-25 19:55:19

标签: pandas multi-index

我正在处理一个看起来像这样的多索引数据框:

multi indexed data frame

(抱歉写了null而不是NaN)

找出突出显示的模式的最有效方法是什么?

我希望得到像这样的结果:

pattern occurrences i am looking for

提前感谢任何见解!

谁想玩它:

from io import StringIO
import pandas as pd


df1_text = """       A  B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""

df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)

1 个答案:

答案 0 :(得分:1)

这是一种方法。简而言之,您可以使用

df2 = df.swaplevel(0,1).unstack().notnull()
print(pd.Series(np.dot(df2.index, df2)).value_counts())

第一行创建df2,它将通道列与9列非空的单元格的布尔指示符对齐,例如。

         # A                    B                    C
    # STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
# CH1  False  False  False  False   True  False  False  False  False
# CH2  False  False   True   True   True  False  False  False  False
# CH3   True  False   True   True  False   True  False   True  False
# CH4   True  False   True   True  False   True  False   True  False

第二步的目标是用表示事件的字符串替换df2中的每一列。使用Python字符串可以乘以整数的事实,我们得到

np.dot([CH1, CH2, CH3, CH4], [True, True, False, False])      <==>
'CH1' * True + 'CH2' * True + 'CH3' * False + 'CH4' * False   <==>
'CH1' * 1 + 'CH2' * 1 + 'CH3' * 0 + 'CH4' * 0                 <==>
'CH1' + 'CH2'                                                 <==>
'CH1CH2'

这有一个美化缺陷,即省略逗号并包含一个空的&#34;事件

完整示例:

from io import StringIO
import pandas as pd


df1_text = """       A  B C
STAND1 CH1 NaN NaN NaN
STAND1 CH2 NaN 11.2 NaN
STAND1 CH3 12.4 7.0 NaN
STAND1 CH4 10.2 2.0 NaN
STAND2 CH1 NaN 2.5 NaN
STAND2 CH2 NaN 11.2 NaN
STAND2 CH3 NaN NaN 6.3
STAND2 CH4 NaN NaN 23.5
STAND3 CH1 NaN NaN NaN
STAND3 CH2 12.3 NaN NaN
STAND3 CH3 5.3 4.5 NaN
STAND3 CH4 7.2 25.6 NaN"""

df1 = pd.read_csv(StringIO(df1_text), delim_whitespace=True)

# solution
df2 = df.swaplevel(0,1).unstack().notnull()
print(pd.Series(np.dot(df2.index, df2)).value_counts())

# In [559]: df.swaplevel(0,1).unstack().notnull()
# Out[559]:
         # A                    B                    C
    # STAND1 STAND2 STAND3 STAND1 STAND2 STAND3 STAND1 STAND2 STAND3
# CH1  False  False  False  False   True  False  False  False  False
# CH2  False  False   True   True   True  False  False  False  False
# CH3   True  False   True   True  False   True  False   True  False
# CH4   True  False   True   True  False   True  False   True  False

# In [560]: np.dot(df2.index, df2)
# Out[560]: 
# array(['CH3CH4', '', 'CH2CH3CH4', 'CH2CH3CH4', 'CH1CH2', 'CH3CH4', '',
       # 'CH3CH4', ''], dtype=object)

# In [561]: pd.Series(np.dot(df2.index, df2)).value_counts()
# Out[561]: 
# CH3CH4       3
             # 3
# CH2CH3CH4    2
# CH1CH2       1
# dtype: int64