我有一个示例数据集,我希望将其分组为一列,然后根据现有列的所有值生成4个新列。
以下是一些示例数据:
data = {'AlignmentId': {0: u'ENSMUST00000000001.4-1',
1: u'ENSMUST00000000001.4-1',
2: u'ENSMUST00000000003.13-0',
3: u'ENSMUST00000000003.13-0',
4: u'ENSMUST00000000003.13-0'},
'name': {0: u'NonCodingDeletion',
1: u'NonCodingInsertion',
2: u'CodingDeletion',
3: u'CodingInsertion',
4: u'NonCodingDeletion'},
'value_CDS': {0: nan, 1: nan, 2: 1.0, 3: 1.0, 4: nan},
'value_mRNA': {0: 21.0, 1: 26.0, 2: 1.0, 3: 1.0, 4: 2.0}}
df = pd.DataFrame.from_dict(data)
看起来像这样:
AlignmentId name value_mRNA value_CDS
0 ENSMUST00000000001.4-1 NonCodingDeletion 21.0 NaN
1 ENSMUST00000000001.4-1 NonCodingInsertion 26.0 NaN
2 ENSMUST00000000003.13-0 CodingDeletion 1.0 1.0
3 ENSMUST00000000003.13-0 CodingInsertion 1.0 1.0
4 ENSMUST00000000003.13-0 NonCodingDeletion 2.0 NaN
我希望根据name
列中是否存在值来返回布尔值,具体取决于value_CDS
是否仅包含空值。我制作了这个函数:
def aggfunc(s):
if s.value_CDS.any():
c = set(s.name)
else:
c = set(s.name)
return ('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
这样做了:
merged = df.groupby('AlignmentId').aggregate(aggfunc)
这给了我错误ValueError: Shape of passed values is (318, 4), indices imply (318, 3)
。
如何从groupby-aggregate返回多个新列?
我正在寻找的输出是:
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
然后我会理想地将其放入5列数据帧中。
如果我使用.apply
,则输出不正确:
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (False, False, False, False)
但如果我一次抓一组,那就是正确的:
In [380]: for aln_id, d in df.groupby('AlignmentId'):
.....: print aggfunc(d)
.....:
(False, False, False, False)
(True, True, True, False)
答案 0 :(得分:3)
您需要将name
更改为['name']
,因为.name
会返回组名称(列分组的值):
def aggfunc(s):
if s.value_CDS.any():
c = set(s['name'])
else:
c = set(s['name'])
return ('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c)
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
AlignmentId
ENSMUST00000000001.4-1 (False, False, False, False)
ENSMUST00000000003.13-0 (True, True, True, False)
dtype: object
def aggfunc(s):
print ('Name of group is: {}'.format((s.name)))
print ('Column name is:\n {}'.format(s['name']))
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
Name of group is: ENSMUST00000000001.4-1
Column name is:
0 NonCodingDeletion
1 NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000001.4-1
Column name is:
0 NonCodingDeletion
1 NonCodingInsertion
Name: name, dtype: object
Name of group is: ENSMUST00000000003.13-0
Column name is:
2 CodingDeletion
3 CodingInsertion
4 NonCodingDeletion
Name: name, dtype: object
改进代码:
def aggfunc(s):
#if and else return same c, so omitted
c = set(s['name'])
#added Series for return columns instead tuples
cols = ['col1','col2','col3','col4']
return pd.Series(('CodingDeletion' in c or 'CodingInsertion' in c,
'CodingInsertion' in c, 'CodingDeletion' in c,
'CodingMult3Deletion' in c or 'CodingMult3Insertion' in c), index=cols)
merged = df.groupby('AlignmentId').apply(aggfunc)
print (merged)
col1 col2 col3 col4
AlignmentId
ENSMUST00000000001.4-1 False False False False
ENSMUST00000000003.13-0 True True True False