我有一张大桌子,只有在两者之间存在差异时,才可以按ID折叠。
我这里有一小部分数据:
df <- structure(list(Uploaded_variation = c("rs616488", "rs616488",
"rs616488", "rs2992756", "rs140850326", "rs17426269", "rs17426269",
"rs11552449", "rs11552449"), Location = c("1:10506158-10506158",
"1:10506158-10506158", "1:10506158-10506158", "1:18480845-18480845",
"1:50380360-50380381", "1:87691240-87691240", "1:87691240-87691240",
"1:113905767-113905767", "1:113905767-113905767"), Allele = c("G",
"G", "G", "C", "-", "A", "A", "T", "T"), Consequence = c("intron_variant",
"intron_variant,non_coding_transcript_variant", "intron_variant",
"upstream_gene_variant", "intergenic_variant", "intron_variant,non_coding_transcript_variant",
"intron_variant,non_coding_transcript_variant", "upstream_gene_variant",
"missense_variant"), IMPACT = c("MODIFIER", "MODIFIER", "MODIFIER",
"MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODERATE"
)), .Names = c("Uploaded_variation", "Location", "Allele", "Consequence",
"IMPACT"), row.names = c(NA, 9L), class = "data.frame")
Uploaded_variation Location Allele Consequence IMPACT
1 rs616488 1:10506158-10506158 G intron_variant MODIFIER
2 rs616488 1:10506158-10506158 G intron_variant,non_coding_transcript_variant MODIFIER
3 rs616488 1:10506158-10506158 G intron_variant MODIFIER
4 rs2992756 1:18480845-18480845 C upstream_gene_variant MODIFIER
5 rs140850326 1:50380360-50380381 - intergenic_variant MODIFIER
6 rs17426269 1:87691240-87691240 A intron_variant,non_coding_transcript_variant MODIFIER
7 rs17426269 1:87691240-87691240 A intron_variant,non_coding_transcript_variant MODIFIER
8 rs11552449 1:113905767-113905767 T upstream_gene_variant MODIFIER
9 rs11552449 1:113905767-113905767 T missense_variant MODERATE
我可以做的是group_by
Uploaded_variation,然后将每个值粘贴在一起
x <- group_by(df, Uploaded_variation) %>%
summarise_all(funs(paste(., collapse = "; ")))
但是,这会将重复信息粘贴在一起,我想要的是仅在值不同时才将信息粘贴在一起。 所需的输出:
Uploaded_variation Location Allele Consequence IMPACT
1 rs616488 1:10506158-10506158 G intron_variant; intron_variant,non_coding_transcript_variant MODIFIER
2 rs2992756 1:18480845-18480845 C upstream_gene_variant MODIFIER
3 rs140850326 1:50380360-50380381 - intergenic_variant MODIFIER
4 rs17426269 1:87691240-87691240 A intron_variant,non_coding_transcript_variant MODIFIER
5 rs11552449 1:113905767-113905767 T upstream_gene_variant; missense_variant MODIFIER; MODERATE
答案 0 :(得分:1)
只需将unique()
添加到您的paste
函数即可-
x <- group_by(df, Uploaded_variation) %>%
summarise_all(funs(paste(unique(.), collapse = "; ")))
# showing just one column
x$Location
[1] "1:113905767-113905767" "1:50380360-50380381" "1:87691240-87691240"
[4] "1:18480845-18480845" "1:10506158-10506158"