Question

我有一个像这样的pandas数据框：

    sequence  positions
0          -          8
1          N          9
2          M         10
3          S         11
4          L         12
5          V         13
6          -         14
7          E         15
8          T         16
9          V         17
10         D         18

在序列栏中，有字母（氨基酸代码）和短划线表示缺口（氨基序列），在列位置我想放置那些氨基酸的位置。它们只是一个数字序列（在本例中从8开始），因此我使用range()生成列。但这个编号是关于氨基酸，而不是差距。列位置应填充破折号并相应移动：

    sequence  positions
0          -          -
1          N          8
2          M          9
3          S         10
4          L         11
5          V         12
6          -          -
7          E         13
8          T         14
9          V         15
10         D         16

所以，我想过迭代行并进行这种修改，但是熊猫手册说它不是一个好主意。可能创建一些函数并将其与pandas apply和shift相结合可以解决问题，但我无法弄清楚如何去做。

Answer 1

这是一种方法：

import pandas as pd

# find out the dashes
dash = df.sequence == "-"

# assign dash to positions where sequence is dash
df.loc[dash, "positions"] = "-"

# assign a sequence of numbers to positions where sequence is not dash
df.loc[~dash, "positions"] = pd.np.arange(8, (~dash).sum()+8)

Answer 2

这是我的解决方案，希望你喜欢它：

df = pd.DataFrame({'sequence': ['-', 'A', 'B', 'C', '-', 'D'], 'positions': range(8, 14)})
seq = df['sequence'].tolist()
pos = iter(df['positions'].tolist())
pos = [next(pos) if a != '-' else '-' for a in seq]
df['positions'] = pos

请注意，此解决方案中没有硬编码。

Answer 3

通过查询不等于短划线.loc的字符串值，在形成布尔掩码后使用双"-"访问器。

df.loc[df.sequence != "-", 'positions'] = df['positions'].values
df.loc[df.sequence == "-", 'positions'] = "-"

在pandas数据框中填充值和移位列

3 个答案: