Question

我在第二栏中有一个带数字的df。每个数字代表DNA序列的长度。我想创建两个新列，其中第一列说明该序列的开始位置，第二列说明该序列的结束位置。

这是我当前的df：

    Names  LEN
0     Ribosomal_S9:  121
1     Ribosomal_S8:  129
2    Ribosomal_L10:  100
3             GrpE:  166
4           DUF150:  141
..              ...  ...
115      TIGR03632:  117
116      TIGR03654:  175
117      TIGR03723:  314
118      TIGR03725:  212
119      TIGR03953:  188

[120 rows x 2 columns]

这就是我想要得到的

              Names  LEN    Start   End
0     Ribosomal_S9:  121     0      121
1     Ribosomal_S8:  129    121     250
2    Ribosomal_L10:  100    250     350 
3             GrpE:  166    350     516
4           DUF150:  141    516     657 
..              ...  ...   ...  ..
115      TIGR03632:  117          
116      TIGR03654:  175          
117      TIGR03723:  314          
118      TIGR03725:  212          
119      TIGR03953:  188          

[120 rows x 4 columns]

可以请任何人把我带到正确的方向吗？

Answer 1

将DataFrame.assign与使用Series.cumsum创建的新列一起使用，并开始添加Series.shift：

#convert column to integers
df['LEN'] = df['LEN'].astype(int)
#alternative for replace non numeric to missing values
#df['LEN'] = pd.to_numeric(df['LEN'], errors='coerce')

s = df['LEN'].cumsum()
df = df.assign(Start = s.shift(fill_value=0), End = s)
print (df)
            Names  LEN  Start  End
0   Ribosomal_S9:  121      0  121
1   Ribosomal_S8:  129    121  250
2  Ribosomal_L10:  100    250  350
3           GrpE:  166    350  516
4         DUF150:  141    516  657

使用数学和现有列在熊猫数据框中创建新值

1 个答案: