我可能比原来要难得多。
数据框如下所示:
CHROMOSOME START END
CHR1 100 200
CHR2 300 400
我的目标是从中创建一个数据行,其中包含4行,如下所示。
CHROMOSOME START END LABEL
CHR1 150 250 ROW_1_A
CHR1 170 270 ROW_1_B
CHR2 350 300 ROW_2_A
CHR2 370 400 ROW_2_B
因此,我需要获取每一行,将其分为A和B,并修改开始和结束,然后标记A或B行并将其重新构建为数据框。
这是我的功能,用于拆分,修改和标记单行。
def getcoordinates(df, awindow = 500, bwindow = 500):
index = df[0]
chromosome = df[1]
start = df[2]
end = df[3]
sv_length = df[8]
track = {'CHROMOSOME': chromosome,
'START': start,
'END': end}
track = pd.DataFrame(data=track, index=[0])
trackA = track.copy()
trackB = track.copy()
trackA = trackA.assign(LABEL = ("AVN_DEL_" + str(index) + "_A"))
trackB = trackB.assign(LABEL = ("AVN_DEL_" + str(index) + "_B"))
trackA = trackA.assign(END = trackA["START"])
trackA = trackA.assign(START = trackA["START"] - awindow)
trackB = trackB.assign(START = trackB["END"])
trackB = trackB.assign(END = trackB["END"] + bwindow)
return trackA.append(trackB)
这是我的for循环,可对数据帧的每一行执行此操作并重新组装。
appended_data = []
for row in SV.itertuples():
print(row)
out = getcoordinates(row)
appended_data.append(out)
appended_data = pd.concat(appended_data, axis=1)
这是正在运行的实际代码。
appended_data = []
for row in SV.itertuples():
print(row)
out = getcoordinates(row)
appended_data.append(out)
appended_data = pd.concat(appended_data, axis=1)
Pandas(Index=0, CHROMOSOME=u'chr1', START=56365453, END=56369289, SV_TYPE=u'DEL', CALLERS=u'GROM;delly;manta;lumpy', LEFT_JUNCTION=u'L1M', RIGHT_JUNCTION=u'L1M', SV_LENGTH=3836, _9=u'DGV', FULL_INFO_ABOUT_ME=u'4_L1MC4_56365281_56365445_92_2.4;L1HS_56365452_56369282_101_2.63;L1HS_56365452_56369282_93_2.42;L1MC4_56369289_56369625_100_2.61')
Pandas(Index=1, CHROMOSOME=u'chr1', START=75645801, END=79014667, SV_TYPE=u'DEL', CALLERS=u'GROM;manta;lumpy', LEFT_JUNCTION=u'L1P', RIGHT_JUNCTION=u'L1P', SV_LENGTH=3368866, _9=u' ', FULL_INFO_ABOUT_ME=u'2_L1PA5_75644642_75646421_300_0.01;L1PA4_79013861_79016088_300_0.01')
appended_data.head()
CHROMOSOME END START ... END START LABEL
0 chr1 56365453 56364953 ... 75645801 75645301 AVN_DEL_1_A
0 chr1 56369789 56369289 ... 79015167 79014667 AVN_DEL_1_B
请注意,在最终结果中这些行是如何错误地连接在一起的。我认为这是由于getcoordinates函数中的这一行所致:
track = pd.DataFrame(data=track, index=[0])
我想将索引设置为将每个数据帧行转换为元组时获得的变量索引,但是我不断收到错误消息:
ValueError: Shape of passed values is (8, 6), indices imply (8, 4)
我很难从tidyverse过渡到熊猫。所以,请放轻松。
答案 0 :(得分:1)
不确定这是否是最佳方法,但是可以通过定义以下函数来为旧df
中的每一行创建新的2行来实现:
def get_new(row, awindow, bwindow):
new_row_A = {}
new_row_A['CHROMOSOME'] = row['CHROMOSOME']
new_row_A['START'] = row['START']-awindow
new_row_A['END'] = row['START']
new_row_A['LABEL'] = 'AVN_DEL_'+str(row.name)+'_A'
new_row_B = {}
new_row_B['CHROMOSOME'] = row['CHROMOSOME']
new_row_B['START'] = row['END']
new_row_B['END'] = row['END']+bwindow
new_row_B['LABEL'] = 'AVN_DEL_'+str(row.name)+'_B'
return [new_row,new_row_B]
然后在每一行上调用此函数,如下所示:
awindow = 500
bwindow = 500
new_df = pd.DataFrame()
for new_row in df.apply(lambda row: get_new(row, awindow, bwindow), axis=1):
new_df = new_df.append(pd.DataFrame(new_row))
new_df.reset_index(drop=True, inplace=True)