Question

我有一个df：

import pandas as pd
df.head(20)
                          id  ch      start        end strand
0   10:100026072-100029645(+)  10  100026072  100029645      +
1   10:110931880-110932381(+)  10  110931880  110932381      +
2   10:110932431-110933096(+)  10  110932431  110933096      +
3   10:111435307-111439556(-)  10  111435307  111439556      -
4   10:115954439-115964883(-)  10  115954439  115964883      -
5   10:115986231-116018509(-)  10  115986231  116018509      -
6   10:116500106-116500762(-)  10  116500106  116500762      -
7   10:116654355-116657389(-)  10  116654355  116657389      -
8   10:117146840-117147002(-)  10  117146840  117147002      -
9   10:126533798-126533971(-)  10  126533798  126533971      -
10  10:127687390-127688824(+)  10  127687390  127688824      +
11    10:19614164-19624369(-)  10   19614164   19624369      -
12    10:42537888-42543687(+)  10   42537888   42543687      +
13    10:61927486-61931038(-)  10   61927486   61931038      -
14    10:70699779-70700206(-)  10   70699779   70700206      -
15    10:76532243-76532565(-)  10   76532243   76532565      -
16    10:79336852-79337034(-)  10   79336852   79337034      -
17    10:79342487-79343173(+)  10   79342487   79343173      +
18    10:79373277-79373447(-)  10   79373277   79373447      -
19    10:82322045-82337358(+)  10   82322045   82337358      +

df.shape
(501, 5)

>>>df.dtypes
 id        object
 ch        object
 start     object
 end       object
 strand    object
 dtype: object

问题：

我想基于＆＃39; start＆＃39;执行多项操作。并且＆＃39;结束＆＃39;柱

首先创建两个名为

的附加列

newstart newend  

desiredoperation: if strand == '+':
                     df['newstart'] = end - int(27)
                     df['newend'] = end + 2
                  elif: 
                      strand == '-'
                      df['newstart'] = start - int(3)
                      df['newend'] = start + 26

我怎么能用熊猫这样做，我找到了下面的链接，但不知道如何执行它。如果任何人都可以提供伪代码就会在其上构建。 adding multiple columns to pandas simultaneously

Answer 1

你可以使用np.where，2行但可读

df['newstart'] = np.where(df.strand == '+', df.end-int(27), df.start-int(3))
df['newend'] = np.where(df.strand == '+', df.end+int(2), df.start+int(26))

    id                         ch   start       end      strand newstart    newend
0   10:100026072-100029645(+)   10  100026072   100029645   +   100029618   100029647
1   10:110931880-110932381(+)   10  110931880   110932381   +   110932354   110932383
2   10:110932431-110933096(+)   10  110932431   110933096   +   110933069   110933098
3   10:111435307-111439556(-)   10  111435307   111439556   -   111435304   111435333
4   10:115954439-115964883(-)   10  115954439   115964883   -   115954436   115954465
5   10:115986231-116018509(-)   10  115986231   116018509   -   115986228   115986257
6   10:116500106-116500762(-)   10  116500106   116500762   -   116500103   116500132
7   10:116654355-116657389(-)   10  116654355   116657389   -   116654352   116654381
8   10:117146840-117147002(-)   10  117146840   117147002   -   117146837   117146866
9   10:126533798-126533971(-)   10  126533798   126533971   -   126533795   126533824

Answer 2

如果你想在熊猫中这样做，df.loc是一个很好的候选者：

df['newstart'] = df['start'] - 3 
df['newend'] = df['start'] + 26
subset = df['strand'] == '+'
df.loc[subset,'newstart']=df.loc[subset,'end']-27
df.loc[subset,'newend']=df.loc[subset,'end']+2

我认为继续使用pandas处理数据是一个好主意：它会保持代码的一致性，并且可能有一种更好，更短的方式来编写上面的代码。

df.loc是一个非常有用的函数来执行数据查找和处理，试图摆弄它，因为它是一个很棒的工具。

享受

pandas多条件并添加多列

2 个答案: