上下文:
我的DataFrame包含以下列:HapID,Marker,Start_position,End_position。 对于每个HapID,我想得到: - 具有最小Start_position的标记(称为leftMarker) - 具有最大End_position的标记(称为rightMarker) - 间隔是差异(最大End_position - 最小Start_position)
我的问题是如何在我知道其索引时检索标记名称。 我收到了下面的错误,虽然我花了好几个小时,但我不确定如何解决它。
以下是错误消息
AttributeError:无法访问可调用属性' iloc' ' SeriesGroupBy'对象,尝试使用' apply'方法
以下是数据
HapID Marker Start_position End_position
hap_1 mk1 1107207 1107256
hap_1 mk2 1104711 1104760
hap_1 mk3 1106845 1106894
hap_2 mk4 11901413 11901462
hap_2 mk5 206031250 206031299
hap_2 mk6 11498893 11498942
hap_2 mk7 17236023 17236072
hap_2 mk8 11692209 11692258
hap_2 mk9 11691512 11691561
hap_2 mk10 11615664 11615713
这是预期的输出
HapID leftMarker rightMarker Start_position End_position Interval
hap_1 mk2 mk1 1104711 1107256 2545
hap_2 mk6 mk5 11498893 206031299 194532406
代码:
import pandas as pd
data = {
'HapID':['hap_1','hap_1','hap_1','hap_2','hap_2','hap_2','hap_2','hap_2','hap_2','hap_2'],
'Marker':['mk1','mk2','mk3','mk4','mk5','mk6','mk7','mk8','mk9','mk10'],
'Start_position':[1107207,1104711,1106845,11901413,206031250,11498893,17236023,11692209,11691512,11615664],
'End_position':[1107256,1104760,1106894,11901462,206031299,11498942,17236072,11692258,11691561,11615713]}
df = pd.DataFrame(data)
haplotypes = df.groupby(df['HapID'])
posi_1 = haplotypes.Start_position.min()
posi_2 = haplotypes.End_position.max()
diff_posi = posi_2 - posi_1
a = haplotypes.Start_position.idxmin()#index at minimum Start_position
b = haplotypes.End_position.idxmax() #index at maximum End_position
#print('{} {} {}'.format(posi_1,posi_2,diff_posi))
#print('{} {}'.format(a,b)) #just to se if I'm getting the index
现在,问题是如何检索每个单倍型的那些位置的标记
leftMarker = haplotypes.Marker.iloc(a)
rightMarker = haplotypes.Marker.iloc(b)
答案 0 :(得分:1)
我认为您需要从原始数据框中检索标记。
leftMarker = df.loc[a,['HapID','Marker']]
rigthMarker = df.loc[b,['HapID','Marker']]
print(leftMarker)
HapID Marker
1 hap_1 mk2
5 hap_2 mk6
print(rightMarker)
HapID Marker
0 hap_1 mk1
4 hap_2 mk5
答案 1 :(得分:0)
这是将函数应用于pandas groupby
的相当简单的情况。您应该阅读pandas docs on how to use groupby以更好地了解如何/何时使用此技术。
def my_fn(df):
mk_min = df.loc[df['Start_position'].idxmin()]
mk_max = df.loc[df['End_position'].idxmax()]
vals = [mk_min['Marker'], mk_max['Marker'], mk_min['Start_position'], mk_max['End_position'], mk_max['End_position'] - mk_min['Start_position']]
idx = ['leftMarker', 'rightMarker', 'Start_position', 'End_position', 'Interval']
return pd.Series(vals, index=idx)
df.groupby('HapID').apply(my_fn)
返回
leftMarker rightMarker Start_position End_position Interval
HapID
hap_1 mk2 mk1 1104711 1107256 2545
hap_2 mk6 mk5 11498893 206031299 194532406