我在一个列中有一个大型数据框,有很多重复数据。我正在尝试解析数据框,以便每个副本只剩下一个条目,除非所有条目都是重复的。(找不到任何帮助第二部分的stackoverflow答案......)
示例df代码:
mydf <- data.frame(accession=c("A", "A", "A", "A", "B", "B", "C", "C", "D"), gene=c("unknown", "red1", "red2", "blue", "green1", "green2", "unknown", "unknown2", "violet"), ident=c(100.0, 95.3, 80.2, 65.1, 94.2, 100.0, 97.1, 90.0, 86))
df看起来像这样:
accession gene ident
1 A unknown 100.0
2 A red1 95.3
3 A red2 80.2
4 A blue 65.1
5 B green1 94.2
6 B green2 100.0
7 C unknown 97.1
8 C unknown2 90.0
9 D violet 86.0
我想要的输出表是这样的:
accession gene ident
2 A red1 95.3
6 B green2 100.0
7 C unknown 97.1
8 C unknown2 90.0
只保留accession
的一个唯一值,基于具有最高gene
的“已知”ident
,除非所有重复的条目特定accession
包含字符串unknown*
。
我在最后一部分陷入困境 - 如果accession
包含gene
,请保留重复的unknown*
的所有行。这就是我到目前为止所做的:
library(dplyr)
mydf$dup <- duplicated(mydf$accession, fromLast = FALSE)|duplicated(mydf$accession, fromLast = TRUE)
mydf <- mydf %>% group_by(accession) %>% mutate(count=n())
mydf <- subset.data.frame(mydf, mydf$dup == TRUE)
mydf <- mydf %>% group_by(accession) %>% filter(!grepl("unknown", gene)) %>% top_n(1,ident)
给出:
accession gene ident dup count
2 A red1 95.3 TRUE 4
6 B green2 100.0 TRUE 2
我的直觉是做if
声明:
mydf <- mydf %>% group_by(accession) %>%
if(count(grepl("unknown", mydf$gene))!= mydf$count)
{filter(!grepl("unknown", gene))}
%>% top_n(1, ident)
但我遇到了一个错误:
if(。)count(grepl(“unknown”,mydf $ gene))的错误!= mydf $ count else {:参数不可解释为逻辑另外:警告 message:if if(。)count(grepl(“unknown”,mydf $ gene))!= mydf $ count else {:条件有长度&gt; 1,只有第一个元素 将被使用
什么是正确的解决方案?如果有更好的方法,我没有和dplyr结婚!谢谢!
答案 0 :(得分:2)
你可以试试这个:
mydf %>%
group_by(accession) %>%
mutate(n = n()) %>%
filter(n > 1) %>%
mutate(ident_rnk = min_rank(ident),
ident_rnk = if_else(grepl("unknown",gene),-1L,ident_rnk)) %>%
top_n(n = 1,wt = ident_rnk) %>%
select(accession,gene,ident)
答案 1 :(得分:2)
另一种选择:
1)首先安排数据框并将function resetIndicatorsUpdateYears() {
var newD = new Date();
var hour = newD.getHours();
var minute = newD.getMinutes();
var second = newD.getSeconds();
var s3 = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Intro")
s3.getRange("F29").setNote("Started last update cycle on: "+ newD);
s3.getRange("F31:F46").setValue("Skipped").clearFormat();
s3.getRange("F30").setValue("Waiting").clearFormat();
var data = [["Waiting for Year and English Lists Update.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Company Lists Update 0001-0350.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Company Lists Update 0351-0700.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Company Lists Update 0701-1050.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Company Lists Update 1051-1400.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Company Lists Update 1401-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Bundled Games In Year and English Lists Update.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Bundled Games In Company Lists Removal 0001-0600.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Bundled Games In Company Lists Removal 0601-1200.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Bundled Games In Company Lists Removal 1201-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Sorting 0001-0400.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Sorting 0401-0800.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Sorting 0801-1200.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Sorting 1201-1600.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Sorting 1601-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Removal 0001-0900.\n\n\n"+"Started at: "+hour+":"+minute+":"+second],
["Skipped Empty Rows Removal 0900-XXXX.\n\n\n"+"Started at: "+hour+":"+minute+":"+second]];
s3.getRange("F30:F46").setNotes(data);
// deleteFormulas();
Logger.log('I was called! TEST');
}
排序到每个组的末尾,同时按降序排序unkown
;
2)每组过滤,确保该组的行数大于1,然后第一个ident
以gene
开头,这意味着整个组包含{{1}因为unknown
已经排序到最后或者排在第一行:
unknown