您好,我需要帮助才能在数据框中添加两个新列,例如:
Name start1 end1
OK0100087.1_0 0 375
OK0100087.1_1 376 750
OK0100087.1_2 751 1000
OK0100088.1 0 87766
OK0100089.1 0 66778
OK0100090.1_0 0 47519
OK0100090.1_1 47520 73733
,想法是添加start2
和end2
,例如:
Name start1 end1 start2 end2
OK0100087.1_0 0 375 1000 625
OK0100087.1_1 376 750 624 250
OK0100087.1_2 751 1000 249 0
OK0100088.1 0 87766 87766 0
OK0100089.1 0 66778 66778 0
OK0100090.1_0 0 47519 73733 26214
OK0100090.1_1 47520 73733 26213 0
因此,找到start2
和end2
新值的想法是在每个Name
中 content_number
例如OK0100087.1
:
Name start1 end1 start2 end2
OK0100087.1_0 0 375
OK0100087.1_1 376 750
OK0100087.1_2 751 1000
采用最高值= 1000
然后第一个start2
将是 1000。
Name start1 end1 start2 end2
OK0100087.1_0 0 375 1000
OK0100087.1_1 376 750
OK0100087.1_2 751 1000
那么第一个end2
将是= start2-(end1-start1),因此 1000-(375-0)= 625
Name start1 end1 start2 end2
OK0100087.1_0 0 375 1000 625
OK0100087.1_1 376 750
OK0100087.1_2 751 1000
然后第二个start2
将是 end2-1(625-1)= 624
Name start1 end1 start2 end2
OK0100087.1_0 0 375 1000 625
OK0100087.1_1 376 750 624
然后再次end2
将是 start2-(end1-start1),因此 624-(750-376)= 250
Name start1 end1 start2 end2
OK0100087.1_0 0 375 1000 625
OK0100087.1_1 376 750 624 250
等
最后我们应该得到:
Name start1 end1 start2 end2
OK0100087.1_0 0 375 1000 625
OK0100087.1_1 376 750 624 250
OK0100087.1_2 751 1000 249 0
OK0100088.1 0 87766 87766 0
OK0100089.1 0 66778 66778 0
OK0100090.1_0 0 47519 73733 26214
OK0100090.1_1 47520 73733 26213 0
有人这样做有想法吗?非常感谢您的帮助
答案 0 :(得分:3)
这只是groupby().transform()
,因为您可以提取唯一的名称:
total = df.groupby(df.Name.str.extract('^([^\.]+)')[0])['end1'].transform('max')
df['start2'] = total - df['start1']
df['end2'] = total - df['end1']
输出:
Name start1 end1 start2 end2
0 OK0100087.1_0 0 375 1000 625
1 OK0100087.1_1 376 750 624 250
2 OK0100087.1_2 751 1000 249 0
3 OK0100088.1 0 87766 87766 0
4 OK0100089.1 0 66778 66778 0
5 OK0100090.1_0 0 47519 73733 26214
6 OK0100090.1_1 47520 73733 26213 0
答案 1 :(得分:1)
npx babel src --out-dir lib
输出
df = pd.DataFrame({'Name': ['OK0100087.1_0',
'OK0100087.1_1',
'OK0100087.1_2',
'OK0100088.1',
'OK0100089.1',
'OK0100090.1_0',
'OK0100090.1_1'],
'start1': [0, 376, 751, 0, 0, 0, 47520],
'end1': [375, 750, 1000, 87766, 66778, 47519, 73733]})
df['base'] = df['Name'].apply(lambda x: x.split('_')[0])
df['start2'] = df.groupby('base')['end1'].transform('max')
output = pd.DataFrame(columns = df.columns)
for index, group in df.groupby('base'):
t = group.copy()
for x in range(len(group)):
t['end2'] = t['start2'] - (t['end1'] - t['start1'])
t['start2'].update((t['end2'] - 1).shift(1))
output = output.append(t)
output.drop(columns='base', inplace=True)
output['end2'] = output['end2'].astype(int)