基本上,我有一个数据框df
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A G NA NA F
Pathway8 A G NA NA E
Pathway9 A G Z H F
Pathway6 A G Z H E
Pathway2 A G D NA F
Pathway5 A G D NA E
Pathway1 A D K NA F
Pathway7 A B C D F
Pathway4 A B C D E
现在我想整合行看起来像这样:
newdf
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A G NA NA F, E
Pathway9 A G Z H F, E
Pathway2 A G D NA F, E
Pathway1 A D K NA F
Pathway4 A B C D F, E
这是我提出的过去问题的延续(Consolidating duplicate rows in a dataframe)。这适用于此数据集,但对于我更大的数据集,它似乎不会组合这些值。例如,前几行输出(在我修改@Matt Jewett给出的代码或使用Concatenate strings by group with dplyr中提供的解释之后):
Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway1 Smoothened Gl-1 Osteopontin
Pathway2 Smoothened Gl-1 BMP2 Osteopontin
Pathway3 Smoothened Gl-1 BMP2 DLX5
Pathway4 Smoothened Gl-1 BMP2 Osteopontin
如您所见,有几个问题。首先,Biomarker1列似乎没有聚合。其次,有几行的重复。我在解决方案方面遇到了障碍,所以你们可以想到的任何解决方案都会非常感激!
非常感谢你的帮助!
答案 0 :(得分:1)
使用data.table
library(data.table)
dat <- fread("Pathway Beginning1 Protein2 Protein3 Protein4 Biomarker1
Pathway3 A G NA NA F
Pathway8 A G NA NA E
Pathway9 A G Z H F
Pathway6 A G Z H E
Pathway2 A G D NA F
Pathway5 A G D NA E
Pathway1 A D K NA F
Pathway7 A B C D F
Pathway4 A B C D E")
dat_collapse <- dat[, .(Pathway = Pathway[1],
Biomarker1 = paste0(Biomarker1, collapse = ", ")),
by = .(Beginning1, Protein2, Protein3, Protein4)]
setcolorder(dat_collapse, names(dat))
dat_collapse
结果:
Pathway Beginning1 Protein2 Protein3 Protein4 Biomarker1
1: Pathway3 A G NA NA F, E
2: Pathway9 A G Z H F, E
3: Pathway2 A G D NA F, E
4: Pathway1 A D K NA F
5: Pathway7 A B C D F, E