仅当ID不同时才按ID折叠列并粘贴值

时间:2019-03-22 01:28:03

标签: r dplyr tidyr

我有一张大桌子,只有在两者之间存在差异时,才可以按ID折叠。

我这里有一小部分数据:

df <- structure(list(Uploaded_variation = c("rs616488", "rs616488", 
"rs616488", "rs2992756", "rs140850326", "rs17426269", "rs17426269", 
"rs11552449", "rs11552449"), Location = c("1:10506158-10506158", 
"1:10506158-10506158", "1:10506158-10506158", "1:18480845-18480845", 
"1:50380360-50380381", "1:87691240-87691240", "1:87691240-87691240", 
"1:113905767-113905767", "1:113905767-113905767"), Allele = c("G", 
"G", "G", "C", "-", "A", "A", "T", "T"), Consequence = c("intron_variant", 
"intron_variant,non_coding_transcript_variant", "intron_variant", 
"upstream_gene_variant", "intergenic_variant", "intron_variant,non_coding_transcript_variant", 
"intron_variant,non_coding_transcript_variant", "upstream_gene_variant", 
"missense_variant"), IMPACT = c("MODIFIER", "MODIFIER", "MODIFIER", 
"MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODIFIER", "MODERATE"
)), .Names = c("Uploaded_variation", "Location", "Allele", "Consequence", 
"IMPACT"), row.names = c(NA, 9L), class = "data.frame")

  Uploaded_variation              Location Allele                                  Consequence   IMPACT
1           rs616488   1:10506158-10506158      G                               intron_variant MODIFIER
2           rs616488   1:10506158-10506158      G intron_variant,non_coding_transcript_variant MODIFIER
3           rs616488   1:10506158-10506158      G                               intron_variant MODIFIER
4          rs2992756   1:18480845-18480845      C                        upstream_gene_variant MODIFIER
5        rs140850326   1:50380360-50380381      -                           intergenic_variant MODIFIER
6         rs17426269   1:87691240-87691240      A intron_variant,non_coding_transcript_variant MODIFIER
7         rs17426269   1:87691240-87691240      A intron_variant,non_coding_transcript_variant MODIFIER
8         rs11552449 1:113905767-113905767      T                        upstream_gene_variant MODIFIER
9         rs11552449 1:113905767-113905767      T                             missense_variant MODERATE

我可以做的是group_by Uploaded_variation,然后将每个值粘贴在一起

x <- group_by(df, Uploaded_variation) %>%
        summarise_all(funs(paste(., collapse = "; ")))

但是,这会将重复信息粘贴在一起,我想要的是仅在值不同时才将信息粘贴在一起。 所需的输出:

  Uploaded_variation              Location Allele                                  Consequence                                               IMPACT
1           rs616488   1:10506158-10506158      G                               intron_variant; intron_variant,non_coding_transcript_variant MODIFIER
2          rs2992756   1:18480845-18480845      C                        upstream_gene_variant                                               MODIFIER
3        rs140850326   1:50380360-50380381      -                           intergenic_variant                                               MODIFIER
4         rs17426269   1:87691240-87691240      A intron_variant,non_coding_transcript_variant                                               MODIFIER
5         rs11552449 1:113905767-113905767      T                        upstream_gene_variant; missense_variant                             MODIFIER; MODERATE

1 个答案:

答案 0 :(得分:1)

只需将unique()添加到您的paste函数即可-

x <- group_by(df, Uploaded_variation) %>%
  summarise_all(funs(paste(unique(.), collapse = "; ")))

# showing just one column
x$Location
[1] "1:113905767-113905767" "1:50380360-50380381"   "1:87691240-87691240"  
[4] "1:18480845-18480845"   "1:10506158-10506158"