熊猫-迭代数据框行,对其进行修改,然后在for循环熊猫中重建数据框

时间:2018-07-24 02:01:02

标签: python pandas

我可能比原来要难得多。

数据框如下所示:

CHROMOSOME START END
CHR1       100   200
CHR2       300   400

我的目标是从中创建一个数据行,其中包含4行,如下所示。

CHROMOSOME START END LABEL
CHR1       150   250 ROW_1_A
CHR1       170   270 ROW_1_B
CHR2       350   300 ROW_2_A
CHR2       370   400 ROW_2_B

因此,我需要获取每一行,将其分为A和B,并修改开始和结束,然后标记A或B行并将其重新构建为数据框。

这是我的功能,用于拆分,修改和标记单行。

def getcoordinates(df, awindow = 500, bwindow = 500):

    index = df[0]
    chromosome = df[1]
    start = df[2]
    end = df[3]
    sv_length = df[8]

    track = {'CHROMOSOME': chromosome,
            'START': start,
            'END': end}

    track = pd.DataFrame(data=track, index=[0])

    trackA = track.copy()
    trackB = track.copy()

    trackA = trackA.assign(LABEL = ("AVN_DEL_" + str(index) + "_A"))
    trackB = trackB.assign(LABEL = ("AVN_DEL_" + str(index) + "_B"))

    trackA = trackA.assign(END = trackA["START"])
    trackA = trackA.assign(START = trackA["START"] - awindow)

    trackB = trackB.assign(START = trackB["END"])
    trackB = trackB.assign(END = trackB["END"] + bwindow)

    return trackA.append(trackB)

这是我的for循环,可对数据帧的每一行执行此操作并重新组装。

appended_data = []
for row in SV.itertuples():
    print(row)
    out = getcoordinates(row)
    appended_data.append(out)

appended_data = pd.concat(appended_data, axis=1)

这是正在运行的实际代码。

appended_data = []
for row in SV.itertuples():
    print(row)
    out = getcoordinates(row)
    appended_data.append(out)
appended_data = pd.concat(appended_data, axis=1)
Pandas(Index=0, CHROMOSOME=u'chr1', START=56365453, END=56369289, SV_TYPE=u'DEL', CALLERS=u'GROM;delly;manta;lumpy', LEFT_JUNCTION=u'L1M', RIGHT_JUNCTION=u'L1M', SV_LENGTH=3836, _9=u'DGV', FULL_INFO_ABOUT_ME=u'4_L1MC4_56365281_56365445_92_2.4;L1HS_56365452_56369282_101_2.63;L1HS_56365452_56369282_93_2.42;L1MC4_56369289_56369625_100_2.61')
Pandas(Index=1, CHROMOSOME=u'chr1', START=75645801, END=79014667, SV_TYPE=u'DEL', CALLERS=u'GROM;manta;lumpy', LEFT_JUNCTION=u'L1P', RIGHT_JUNCTION=u'L1P', SV_LENGTH=3368866, _9=u' ', FULL_INFO_ABOUT_ME=u'2_L1PA5_75644642_75646421_300_0.01;L1PA4_79013861_79016088_300_0.01')
appended_data.head()
  CHROMOSOME       END     START     ...            END     START        LABEL
0       chr1  56365453  56364953     ...       75645801  75645301  AVN_DEL_1_A
0       chr1  56369789  56369289     ...       79015167  79014667  AVN_DEL_1_B

请注意,在最终结果中这些行是如何错误地连接在一起的。我认为这是由于getcoordinates函数中的这一行所致:

track = pd.DataFrame(data=track, index=[0])

我想将索引设置为将每个数据帧行转换为元组时获得的变量索引,但是我不断收到错误消息:

ValueError: Shape of passed values is (8, 6), indices imply (8, 4)

我很难从tidyverse过渡到熊猫。所以,请放轻松。

1 个答案:

答案 0 :(得分:1)

不确定这是否是最佳方法,但是可以通过定义以下函数来为旧df中的每一行创建新的2行来实现:

def get_new(row, awindow, bwindow):                                
    new_row_A = {}         
    new_row_A['CHROMOSOME'] = row['CHROMOSOME']                        
    new_row_A['START'] = row['START']-awindow
    new_row_A['END'] = row['START']
    new_row_A['LABEL'] = 'AVN_DEL_'+str(row.name)+'_A'
    new_row_B = {}
    new_row_B['CHROMOSOME'] = row['CHROMOSOME']
    new_row_B['START'] = row['END']
    new_row_B['END'] = row['END']+bwindow
    new_row_B['LABEL'] = 'AVN_DEL_'+str(row.name)+'_B'
    return [new_row,new_row_B]

然后在每一行上调用此函数,如下所示:

awindow = 500
bwindow = 500
new_df = pd.DataFrame()
for new_row in df.apply(lambda row: get_new(row, awindow, bwindow), axis=1):
    new_df = new_df.append(pd.DataFrame(new_row))
new_df.reset_index(drop=True, inplace=True)