熊猫:如何只选择组内标准偏差小的组?

时间:2016-12-02 07:05:17

标签: python pandas

我有一个数据帧,每组ID为+ - 100行。我想对组ID进行分组,然后只保留列的标准偏差低于阈值的组。我使用以下代码

# df is the dataframe with all rows
# group on groupID
df_grouped = df.groupby('groupID')

# this gives a table with groupID and the std within a group 
df_grouped_std = df_grouped.std() 

# from the df with standard deviations, I select only the groups 
# where the standard deviation is withing limits
selection = df_grouped_std[df_grouped_std['col1']<1][df_grouped_std['col2']<0.05]

# now I try to select from the original dataframe 'df_grouped' the groups that were selected in the previous step.
df_plot = df_grouped[selection]

Stacktrace:

   Traceback (most recent call last):

  File "<ipython-input-72-2cd045ecb262>", line 1, in <module>
    runfile('C:/Documents and Settings/a708818/Desktop/coloredByRol.py', wdir='C:/Documents and Settings/a708818/Desktop')

  File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/Documents and Settings/a708818/Desktop/coloredByRol.py", line 50, in <module>
    df_plot = df_grouped[selection]

  File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 3170, in __getitem__
    if key not in self.obj:

  File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 688, in __contains__
    return key in self._info_axis

  File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 885, in __contains__
    hash(key)

  File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 647, in __hash__
    ' hashed'.format(self.__class__.__name__))

TypeError: 'DataFrame' objects are mutable, thus they cannot be hashedus they cannot be hashed

我无法弄清楚如何选择我想要的数据。任何提示?

1 个答案:

答案 0 :(得分:1)

我认为你可以使用:

df_grouped = df.groupby('groupID')
#get std per groups
df_grouped_std = df_grouped.std() 
print (df_grouped_std)
#select by conditions 
selection = df_grouped_std[ (df_grouped_std['col1']<1) & (df_grouped_std['col2']<0.05)]
print (selection)

#select all rows of original df where groupID is same as index of 'selection'
df_plot = df[df.groupID.isin(selection.index)]
print (df_plot)

样品:

df = pd.DataFrame({'groupID':[1,1,1,2,3,3,2],
                   'col1':[5,3,6,4,7,8,9],
                   'col2':[7,8,9,1,2,3,8]})

print (df)
   col1  col2  groupID
0     5     7        1
1     3     8        1
2     6     9        1
3     4     1        2
4     7     2        3
5     8     3        3
6     9     8        2
df_grouped = df.groupby('groupID')
# 
df_grouped_std = df_grouped.std() 
print (df_grouped_std)
             col1      col2
groupID                    
1        1.527525  1.000000
2        3.535534  4.949747
3        0.707107  0.707107

#change conditions for testing only 
selection = df_grouped_std[ (df_grouped_std['col1']>1) & (df_grouped_std['col2']>3)]
print (selection)
             col1      col2
groupID                    
2        3.535534  4.949747

#
df_plot = df[df.groupID.isin(selection.index)]
print (df_plot)
   col1  col2  groupID
3     4     1        2
6     9     8        2

编辑:

另一种可能的解决方案是使用filter

print (df.groupby('groupID')
         .filter(lambda x: (x.col1.std() > 1) & (x.col2.std() > 3)))

   col1  col2  groupID
3     4     1        2
6     9     8        2