鉴于此表:
╔═══╦══════════╦═══════════╦═════════════╗
║ ║ position ║ amino_var ║ sequence ║
╠═══╬══════════╬═══════════╬═════════════╣
║ 0 ║ 3 ║ A ║ MWSWKCLLFWA ║
║ 1 ║ 4 ║ G ║ MWSWKCLLFWH ║
║ 2 ║ 6 ║ I ║ MWSWKCLFLVH ║
║ 3 ║ 3 ║ C ║ MWSWVESFLVH ║
║ 4 ║ 2 ║ V ║ MWEQAQPWGAH ║
╚═══╩══════════╩═══════════╩═════════════╝
或者您可以使用以下内容构建此数据框:
uniprots = pd.DataFrame({'position': [3,4,6,3,2], 'amino_var': ['A', 'G', 'I', 'C', 'V'], 'sequence': ['MWSWKCLLFWA', 'MWSWKCLLFWH', 'MWSWKCLFLVH', 'MWSWVESFLVH', 'MWEQAQPWGAH']})
我想在position + 1
和position - 1
之间对序列部分进行切片,然后将position
中的字母替换为amino_var
中的字母。
我试过了:
uniprots.sequence.str[uniprots.position - 1 : uniprots.position + 1]
但是我得到了一系列充满NaN的系列。我的预期输出是:
╔═══╦════════╗
║ ║ output ║
╠═══╬════════╣
║ 0 ║ WAW ║
║ 1 ║ SGK ║
║ 2 ║ KIL ║
║ 3 ║ WCW ║
║ 4 ║ MVE ║
╚═══╩════════╝
答案 0 :(得分:2)
我认为您需要先在范围位置之前提取值,然后按范围和replace
提取值,并在范围之后显示所有值:
print (uniprots)
uniprot position amino amino_var sequence
0 P11362 3 W A WWWWWWWWWWW
1 P11362 4 E G MEEEEEELFWH
2 P11362 6 N I MWSWKCNNLVH
3 P11362 3 S C MWSWVESFLVH
4 P11362 3 W V MWEQAQPWGAH
N = 2
def repl(x):
s = x['sequence']
p = x['position']
a1 = x['amino']
a2 = x['amino_var']
return s[:p-N-1] + s[p-N-1:p+N].replace(a1,a2) +s[p+N:]
uniprots['sequence'] = uniprots.apply(repl, axis=1)
print (uniprots)
uniprot position amino amino_var sequence
0 P11362 3 W A AAAAAWWWWWW
1 P11362 4 E G MGGGGGELFWH
2 P11362 6 N I MWSWKCIILVH
3 P11362 3 S C MWCWVESFLVH
4 P11362 3 W V MVEQAQPWGAH
编辑回答编辑:
提取值并加入列amino_var
:
N = 1
a = uniprots.apply(lambda x: x['sequence'][x['position']-N-1 : x['position']-1] , axis=1)
b = uniprots.apply(lambda x: x['sequence'][x['position'] : x['position']+N] , axis=1)
uniprots['sequence'] = a + uniprots['amino_var'] + b
print (uniprots)
position amino_var sequence
0 3 A WAW
1 4 G SGK
2 6 I KIL
3 3 C WCW
4 2 V MVE
答案 1 :(得分:2)
您可以使用DataFrame.apply
:
def get_subsequence(row, width=1):
seq = row['sequence']
pos = row['position']-1
return seq[pos-width:pos] + row['amino_var'] + seq[pos+1:pos+width+1]
uniprots['sequence'] = uniprots.apply(get_subsequence, axis=1)
然后我们获得:
>>> uniprots.apply(get_subsequence, axis=1)
0 WAW
1 SGK
2 KIL
3 WCW
4 MVE
dtype: object
如果我们想要更大的范围,我们可以设置width
参数,例如functools.partial
:
from functools import partial
uniprots['sequence'] = uniprots.apply(partial(get_subsequence, width=3), axis=1)
结果是:
>>> uniprots.apply(partial(get_subsequence, width=3), axis=1)
0 AWKC
1 MWSGKCL
2 SWKILFL
3 CWVE
4 VEQA
字符串没有相等长度的原因是因为我们达到了字符串的边界。
答案 2 :(得分:0)
以下单行也有效:
uniprots['output'] = uniprots.apply(lambda x: x['sequence'][x['position']-1-1] +x['amino_var']+x['sequence'][x['position']-1+1], axis=1)
以下格式更具可读性:
uniprots['output'] = uniprots.apply(lambda x:
x['sequence'][x['position']-1-1] +
x['amino_var'] +
x['sequence'][x['position']-1+1], axis=1)
输出:
print(uniprots)
amino_var position sequence output
0 A 3 MWSWKCLLFWA WAW
1 G 4 MWSWKCLLFWH SGK
2 I 6 MWSWKCLFLVH KIL
3 C 3 MWSWVESFLVH WCW
4 V 2 MWEQAQPWGAH MVE
'位置'值从此表中的1开始,但在python中从0开始,因此必须完成-1
。