我有一个数据帧,每组ID为+ - 100行。我想对组ID进行分组,然后只保留列的标准偏差低于阈值的组。我使用以下代码
# df is the dataframe with all rows
# group on groupID
df_grouped = df.groupby('groupID')
# this gives a table with groupID and the std within a group
df_grouped_std = df_grouped.std()
# from the df with standard deviations, I select only the groups
# where the standard deviation is withing limits
selection = df_grouped_std[df_grouped_std['col1']<1][df_grouped_std['col2']<0.05]
# now I try to select from the original dataframe 'df_grouped' the groups that were selected in the previous step.
df_plot = df_grouped[selection]
Stacktrace:
Traceback (most recent call last):
File "<ipython-input-72-2cd045ecb262>", line 1, in <module>
runfile('C:/Documents and Settings/a708818/Desktop/coloredByRol.py', wdir='C:/Documents and Settings/a708818/Desktop')
File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 682, in runfile
execfile(filename, namespace)
File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Documents and Settings/a708818/Desktop/coloredByRol.py", line 50, in <module>
df_plot = df_grouped[selection]
File "C:\Anaconda\lib\site-packages\pandas\core\groupby.py", line 3170, in __getitem__
if key not in self.obj:
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 688, in __contains__
return key in self._info_axis
File "C:\Anaconda\lib\site-packages\pandas\core\index.py", line 885, in __contains__
hash(key)
File "C:\Anaconda\lib\site-packages\pandas\core\generic.py", line 647, in __hash__
' hashed'.format(self.__class__.__name__))
TypeError: 'DataFrame' objects are mutable, thus they cannot be hashedus they cannot be hashed
我无法弄清楚如何选择我想要的数据。任何提示?
答案 0 :(得分:1)
我认为你可以使用:
df_grouped = df.groupby('groupID')
#get std per groups
df_grouped_std = df_grouped.std()
print (df_grouped_std)
#select by conditions
selection = df_grouped_std[ (df_grouped_std['col1']<1) & (df_grouped_std['col2']<0.05)]
print (selection)
#select all rows of original df where groupID is same as index of 'selection'
df_plot = df[df.groupID.isin(selection.index)]
print (df_plot)
样品:
df = pd.DataFrame({'groupID':[1,1,1,2,3,3,2],
'col1':[5,3,6,4,7,8,9],
'col2':[7,8,9,1,2,3,8]})
print (df)
col1 col2 groupID
0 5 7 1
1 3 8 1
2 6 9 1
3 4 1 2
4 7 2 3
5 8 3 3
6 9 8 2
df_grouped = df.groupby('groupID')
#
df_grouped_std = df_grouped.std()
print (df_grouped_std)
col1 col2
groupID
1 1.527525 1.000000
2 3.535534 4.949747
3 0.707107 0.707107
#change conditions for testing only
selection = df_grouped_std[ (df_grouped_std['col1']>1) & (df_grouped_std['col2']>3)]
print (selection)
col1 col2
groupID
2 3.535534 4.949747
#
df_plot = df[df.groupID.isin(selection.index)]
print (df_plot)
col1 col2 groupID
3 4 1 2
6 9 8 2
编辑:
另一种可能的解决方案是使用filter:
print (df.groupby('groupID')
.filter(lambda x: (x.col1.std() > 1) & (x.col2.std() > 3)))
col1 col2 groupID
3 4 1 2
6 9 8 2