Question

我有一个如下所示的数据框：

Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34

我想通过仅保留每个Reach最高HQ的记录来删除数据框，结果如下：

 Reach Chem HQ 
 a Nickel  1.65
 b Cadmium 3.12
 c Nickel 2.34

这样做的最佳方式是什么？

Answer 1

这是基础R中的单线程（或接近）方法。

获取数据：

test <- read.table(textConnection("Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34"),header=TRUE)

使用Reach和HQ返回by组中which.max组最高的行 do.call(rbind...只是将标识的行连接在一起形成一个数据集。

do.call(rbind,by(test,test$Reach,function(x) x[which.max(x$HQ),]))

结果：

  Reach    Chem   HQ
a     a  Nickel 1.65
b     b Cadmium 3.12
c     c  Nickel 2.34

编辑 - 解决mindless.panda和joran的讨论，关于最大值是否存在联系，这将有效：

do.call(rbind,by(test,test$Reach,function(x) x[x$HQ==max(x$HQ),]))

Answer 2

也许你可以尝试使用？order和？像这样复制：

my_df = data.frame(
    Reach = c("a","a","b","b","b","c","c"), 
    Chem = c("Mercury","Nickel","Mercury","Nickel","Cadmium","Mercury","Nickel"),
    HQ = c(1.12,1.65,1.54,2.34,3.12,2.12,2.34)
    )

my_df = my_df[order(my_df$HQ,decreasing=TRUE),]
my_df = my_df[!duplicated(my_df$Reach),]
my_df = my_df[order(my_df$Reach),]

编辑：为清晰起见，下面显示了结果。

  Reach    Chem   HQ
2     a  Nickel 1.65
5     b Cadmium 3.12
7     c  Nickel 2.34

Answer 3

如果你喜欢plyr方法：

data <- read.table(text="Reach Chem HQ 
a Mercury 1.12
a Nickel  1.65
b Mercury 1.54
b Nickel 2.34
b Cadmium 3.12
c Mercury 2.12
c Nickel 2.34", header=TRUE)

require(plyr)
ddply(data, .(Reach), summarize, Chem=Chem[which.max(HQ)], MaxHQ=max(HQ))

  Reach    Chem  MaxHQ
1     a  Nickel   1.65
2     b Cadmium   3.12
3     c  Nickel   2.34

修改

部分是由此similar question提供动机并考虑不仅仅有一个Chem类型列（列不是子集）并且每个列都复制Chem=Chem[which.max(HQ)]的情况得到详细，我想出了这个。我很好奇，如果有更好的方法，plyr向导可以权衡：

# add the within-group max HQ as a column df <- ddply(data, .(Reach), transform, MaxHQByReach=max(HQ)) # now select the rows where the HQ equals the Max HQ, dropping the above column subset(df, df$HQ==df$MaxHQByReach)[,1:(ncol(df)-1)]

Answer 4

您好，您也可以像这样使用max和lapply：

Reach <- unique(my_df$Reach)
        HQ <- unlist(lapply(1:length(unique(my_df$Reach)),function(x) max(my_df$HQ[which(my_df$Reach == unique(my_df$Reach)[x])])))

        Chem <- my_df$Chem[match(lapply(1:length(unique(my_df$Reach)),function(x) max(my_df$HQ[which(my_df$Reach == unique(my_df$Reach)[x])])),my_df$HQ)]

            new.df <- data.frame(Reach,Chem,HQ)
        new.df

          Reach    Chem   HQ
        1     a  Nickel 1.65
        2     b Cadmium 3.12
        3     c  Nickel 2.34

根据与其他记录的比较从数据框中删除记录

4 个答案: