R - 按dplyr分组,仅当组中的所有成员都重复时才删除重复项

时间:2017-09-10 03:07:18

标签: r duplicates dplyr

我在一个列中有一个大型数据框,有很多重复数据。我正在尝试解析数据框,以便每个副本只剩下一个条目,除非所有条目都是重复的。(找不到任何帮助第二部分的stackoverflow答案......)

示例df代码:

mydf <- data.frame(accession=c("A", "A", "A", "A", "B", "B", "C", "C", "D"), gene=c("unknown", "red1", "red2", "blue", "green1", "green2", "unknown", "unknown2", "violet"), ident=c(100.0, 95.3, 80.2, 65.1, 94.2, 100.0, 97.1, 90.0, 86))

df看起来像这样:

   accession   gene      ident
1  A           unknown   100.0   
2  A           red1      95.3
3  A           red2      80.2
4  A           blue      65.1
5  B           green1    94.2
6  B           green2    100.0
7  C           unknown   97.1
8  C           unknown2  90.0
9  D           violet    86.0

我想要的输出表是这样的:

   accession   gene      ident   
2  A           red1      95.3
6  B           green2    100.0
7  C           unknown   97.1
8  C           unknown2  90.0

只保留accession的一个唯一值,基于具有最高gene的“已知”ident除非所有重复的条目特定accession包含字符串unknown*

我在最后一部分陷入困境 - 如果accession包含gene,请保留重复的unknown*的所有行。这就是我到目前为止所做的:

library(dplyr)
mydf$dup <- duplicated(mydf$accession, fromLast = FALSE)|duplicated(mydf$accession, fromLast = TRUE)
mydf <- mydf %>% group_by(accession) %>% mutate(count=n())
mydf <- subset.data.frame(mydf, mydf$dup == TRUE)
mydf <- mydf %>% group_by(accession) %>% filter(!grepl("unknown", gene)) %>% top_n(1,ident)

给出:

   accession   gene      ident   dup    count   
2  A           red1      95.3    TRUE   4
6  B           green2    100.0   TRUE   2

我的直觉是做if声明:

mydf <- mydf %>% group_by(accession) %>% 
if(count(grepl("unknown", mydf$gene))!= mydf$count)
      {filter(!grepl("unknown", gene))} 
%>% top_n(1, ident)

但我遇到了一个错误:

  

if(。)count(grepl(“unknown”,mydf $ gene))的错误!= mydf $ count else   {:参数不可解释为逻辑另外:警告   message:if if(。)count(grepl(“unknown”,mydf $ gene))!= mydf $ count   else {:条件有长度&gt; 1,只有第一个元素   将被使用

什么是正确的解决方案?如果有更好的方法,我没有和dplyr结婚!谢谢!

2 个答案:

答案 0 :(得分:2)

你可以试试这个:

mydf %>%
  group_by(accession) %>%
  mutate(n = n()) %>%
  filter(n > 1) %>%
  mutate(ident_rnk = min_rank(ident),
         ident_rnk = if_else(grepl("unknown",gene),-1L,ident_rnk)) %>%
  top_n(n = 1,wt = ident_rnk) %>%
  select(accession,gene,ident)

答案 1 :(得分:2)

另一种选择:

1)首先安排数据框并将function resetIndicatorsUpdateYears() { var newD = new Date(); var hour = newD.getHours(); var minute = newD.getMinutes(); var second = newD.getSeconds(); var s3 = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Intro") s3.getRange("F29").setNote("Started last update cycle on: "+ newD); s3.getRange("F31:F46").setValue("Skipped").clearFormat(); s3.getRange("F30").setValue("Waiting").clearFormat(); var data = [["Waiting for Year and English Lists Update.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Company Lists Update 0001-0350.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Company Lists Update 0351-0700.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Company Lists Update 0701-1050.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Company Lists Update 1051-1400.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Company Lists Update 1401-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Bundled Games In Year and English Lists Update.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Bundled Games In Company Lists Removal 0001-0600.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Bundled Games In Company Lists Removal 0601-1200.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Bundled Games In Company Lists Removal 1201-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Sorting 0001-0400.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Sorting 0401-0800.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Sorting 0801-1200.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Sorting 1201-1600.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Sorting 1601-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Removal 0001-0900.\n\n\n"+"Started at: "+hour+":"+minute+":"+second], ["Skipped Empty Rows Removal 0900-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second]]; s3.getRange("F30:F46").setNotes(data); // deleteFormulas(); Logger.log('I was called! TEST'); } 排序到每个组的末尾,同时按降序排序unkown;

2)每组过滤,确保该组的行数大于1,然后第一个identgene开头,这意味着整个组包含{{1}因为unknown已经排序到最后或者排在第一行:

unknown