Question

鉴于此表：

╔═══╦══════════╦═══════════╦═════════════╗
║   ║ position ║ amino_var ║ sequence    ║
╠═══╬══════════╬═══════════╬═════════════╣
║ 0 ║ 3        ║ A         ║ MWSWKCLLFWA ║
║ 1 ║ 4        ║ G         ║ MWSWKCLLFWH ║
║ 2 ║ 6        ║ I         ║ MWSWKCLFLVH ║
║ 3 ║ 3        ║ C         ║ MWSWVESFLVH ║
║ 4 ║ 2        ║ V         ║ MWEQAQPWGAH ║
╚═══╩══════════╩═══════════╩═════════════╝

或者您可以使用以下内容构建此数据框：

uniprots = pd.DataFrame({'position': [3,4,6,3,2], 'amino_var': ['A', 'G', 'I', 'C', 'V'], 'sequence': ['MWSWKCLLFWA', 'MWSWKCLLFWH', 'MWSWKCLFLVH', 'MWSWVESFLVH', 'MWEQAQPWGAH']})

我想在position + 1和position - 1之间对序列部分进行切片，然后将position中的字母替换为amino_var中的字母。

我试过了：

uniprots.sequence.str[uniprots.position - 1 : uniprots.position + 1]

但是我得到了一系列充满NaN的系列。我的预期输出是：

╔═══╦════════╗
║   ║ output ║
╠═══╬════════╣
║ 0 ║ WAW    ║
║ 1 ║ SGK    ║
║ 2 ║ KIL    ║
║ 3 ║ WCW    ║
║ 4 ║ MVE    ║
╚═══╩════════╝

Answer 1

我认为您需要先在范围位置之前提取值，然后按范围和replace提取值，并在范围之后显示所有值：

print (uniprots)
  uniprot  position amino amino_var     sequence
0  P11362         3     W         A  WWWWWWWWWWW
1  P11362         4     E         G  MEEEEEELFWH
2  P11362         6     N         I  MWSWKCNNLVH
3  P11362         3     S         C  MWSWVESFLVH
4  P11362         3     W         V  MWEQAQPWGAH

N = 2
def repl(x):
    s = x['sequence']
    p = x['position']
    a1 = x['amino']
    a2 = x['amino_var']
    return s[:p-N-1] + s[p-N-1:p+N].replace(a1,a2) +s[p+N:] 

uniprots['sequence'] = uniprots.apply(repl, axis=1)
print (uniprots)
  uniprot  position amino amino_var     sequence
0  P11362         3     W         A  AAAAAWWWWWW
1  P11362         4     E         G  MGGGGGELFWH
2  P11362         6     N         I  MWSWKCIILVH
3  P11362         3     S         C  MWCWVESFLVH
4  P11362         3     W         V  MVEQAQPWGAH

编辑回答编辑：

提取值并加入列amino_var：

N = 1
a = uniprots.apply(lambda x:  x['sequence'][x['position']-N-1 : x['position']-1] , axis=1)
b = uniprots.apply(lambda x:  x['sequence'][x['position'] : x['position']+N] , axis=1)

uniprots['sequence'] = a + uniprots['amino_var'] + b                               
print (uniprots)
   position amino_var sequence
0         3         A      WAW
1         4         G      SGK
2         6         I      KIL
3         3         C      WCW
4         2         V      MVE

Answer 2

您可以使用DataFrame.apply：

def get_subsequence(row, width=1):
    seq = row['sequence']
    pos = row['position']-1
    return seq[pos-width:pos] + row['amino_var'] + seq[pos+1:pos+width+1]

uniprots['sequence'] = uniprots.apply(get_subsequence, axis=1)

然后我们获得：

>>> uniprots.apply(get_subsequence, axis=1)
0    WAW
1    SGK
2    KIL
3    WCW
4    MVE
dtype: object

如果我们想要更大的范围，我们可以设置width参数，例如functools.partial：

from functools import partial

uniprots['sequence'] = uniprots.apply(partial(get_subsequence, width=3), axis=1)

结果是：

>>> uniprots.apply(partial(get_subsequence, width=3), axis=1)
0       AWKC
1    MWSGKCL
2    SWKILFL
3       CWVE
4       VEQA

字符串没有相等长度的原因是因为我们达到了字符串的边界。

Answer 3

以下单行也有效：

uniprots['output'] = uniprots.apply(lambda x: x['sequence'][x['position']-1-1] +x['amino_var']+x['sequence'][x['position']-1+1], axis=1)

以下格式更具可读性：

uniprots['output'] = uniprots.apply(lambda x: 
            x['sequence'][x['position']-1-1] +
            x['amino_var'] +
            x['sequence'][x['position']-1+1], axis=1)

输出：

print(uniprots)
  amino_var  position     sequence output
0         A         3  MWSWKCLLFWA    WAW
1         G         4  MWSWKCLLFWH    SGK
2         I         6  MWSWKCLFLVH    KIL
3         C         3  MWSWVESFLVH    WCW
4         V         2  MWEQAQPWGAH    MVE

＆＃39;位置＆＃39;值从此表中的1开始，但在python中从0开始，因此必须完成-1。

使用其他列切片Pandas对象列

3 个答案: