我喜欢合并Pandas Dataframe中长度不等的两列。
我尝试了很多方法进行合并,合并和联接,但是没有用。
keyList = ["Clone", "Chain", "Fragment", "R0", "R1", "R2"]
dataDict = {key: [] for key in keyList}
# Example for different list length
plist1 = ["ABCD", "DJFZ", "DHRZ"]
plist2 = ["ABCD", "DJFZ", "DHRZ", "JGJZ"]
filelist = ["E2_VH_Fab_R0.fasta", "E2_VH_scFV_R0.fasta", "E2_VH_Fab_R1.fasta", "E2_VH_scFV_R1.fasta","E2_VH_Fab_R2.fasta" ]
# Subsets are:
# E1 || E2 with VH || VL with Fab || scFV with R0 || R1 || R2
for file in enumerate(filelist):
# Get List with emits from class function
peptidelist = clseq.processEmits()
# Split filename into 6 parameters, see keylist
fileparms = datafile.split('.')[0].split('_')
# Iterate through peptide list and add the subsets into the dict
for peptide in peptidelist:
dataDict.setdefault("Clone", []).append(sclone)
dataDict.setdefault("Chain", []).append(schain)
dataDict.setdefault("Fragment", []).append(sfragment)
# Set other Rounds as "NaN" to equal the length
if "R0" in sround:
dataDict.setdefault("R0", []).append(peptide)
dataDict.setdefault("R1", []).append("NaN")
dataDict.setdefault("R2", []).append("NaN")
elif "R1" in sround:
dataDict.setdefault("R0", []).append("NaN")
dataDict.setdefault("R1", []).append(peptide)
dataDict.setdefault("R2", []).append("NaN")
elif "R2" in sround:
dataDict.setdefault("R0", []).append("NaN")
dataDict.setdefault("R1", []).append("NaN")
dataDict.setdefault("R2", []).append(peptide)
else:
dataDict.setdefault("R0", []).append("NaN")
dataDict.setdefault("R1", []).append("NaN")
dataDict.setdefault("R2", []).append("NaN")
dtframe.merge(pd.DataFrame(dataDict), on=['Clone', 'Chain', 'Fragment'], how='inner')
问题是,我的列表长度不同,我希望合并成一个数据框,其余的都用NaN填充。
此:
0 E2 VH Fab r0 nan
1 E2 VH Fab r0 nan
2 E2 VH Fab r0 nan
3 E2 VH Fab r0 nan
4 E2 VH Fab r0 nan
5 E2 VH Fab r0 nan
这:
0 E2 VH Fab nan r1
1 E2 VH Fab nan r1
2 E2 VH Fab nan r1
3 E2 VH Fab nan r1
4 E2 VH Fab nan r1
5 E2 VH Fab nan r1
6 E2 VH Fab nan r1
7 E2 VH Fab nan r1
应导致以下结果:
0 E2 VH Fab r0 r1
1 E2 VH Fab r0 r1
2 E2 VH Fab r0 r1
3 E2 VH Fab r0 r1
4 E2 VH Fab r0 r1
5 E2 VH Fab r0 r1
6 E2 VH Fab nan r1
7 E2 VH Fab nan r1
请注意,我所有的数据字段都是字符串。
答案 0 :(得分:1)
这是combine_first
。我们需要将索引设置为要合并的三列,然后为具有许多不同组的真实数据创建附加的cumcount
级别。
df1['idx'] = df1.groupby(['Clone', 'Chain', 'Fragment']).cumcount()
df2['idx'] = df2.groupby(['Clone', 'Chain', 'Fragment']).cumcount()
df1 = df1.set_index(['Clone', 'Chain', 'Fragment', 'idx'])
df2 = df2.set_index(['Clone', 'Chain', 'Fragment', 'idx'])
df1.combine_first(df2).reset_index()
# Clone Chain Fragment idx R0 R1
#0 E2 VH Fab 0 r0 r1
#1 E2 VH Fab 1 r0 r1
#2 E2 VH Fab 2 r0 r1
#3 E2 VH Fab 3 r0 r1
#4 E2 VH Fab 4 r0 r1
#5 E2 VH Fab 5 r0 r1
#6 E2 VH Fab 6 NaN r1
#7 E2 VH Fab 7 NaN r1
df1
Clone Chain Fragment R0 R1
0 E2 VH Fab r0 NaN
1 E2 VH Fab r0 NaN
2 E2 VH Fab r0 NaN
3 E2 VH Fab r0 NaN
4 E2 VH Fab r0 NaN
5 E2 VH Fab r0 NaN
df2
Clone Chain Fragment R0 R1
0 E2 VH Fab NaN r1
1 E2 VH Fab NaN r1
2 E2 VH Fab NaN r1
3 E2 VH Fab NaN r1
4 E2 VH Fab NaN r1
5 E2 VH Fab NaN r1
6 E2 VH Fab NaN r1
7 E2 VH Fab NaN r1