Question

实际上我有一个非常大的数据框架。一列包含ID，另一列包含与该ID关联的值。但是，每个ID都会出现多次不同的值，我希望在丢弃其余ID时记录每个ID的最大值。以下是使用quakes中的R数据集的可复制示例：

data <- as.data.frame(quakes)

##Create output matrix
output <- matrix(,length(unique(data[,5])),2)
colnames(output) <- c("Station ID", "Max Mag")

##Grab unique station IDs
uni <- unique(data[,5])

##Go through each station ID and record the maximum magnitude
for (i in 1:dim(output)[1])
{
  sub.data <- data[which(data[,5]==uni[i]),]
  ##Put station ID in column 1
  output[i,1] <- uni[i]
  ##Put biggest magnitude in column 2
  output[i,2] <- max(sub.data[,4])
}

考虑到我的真实数据，我的数据框的尺寸为100000行，这是一个缓慢的过程。有没有更快的方法来执行这样的任务？

非常感谢任何帮助！

Answer 1

library(plyr)
ddply(data, "stations", function(data){data[which.max(data$mag),]})
       lat   long depth mag stations
1   -27.21 182.43    55 4.6       10
2   -27.60 182.40    61 4.6       11
3   -16.24 168.02    53 4.7       12
4   -27.38 181.70    80 4.8       13
-----

您也可以使用：

> data2 <- data[order(data$mag,decreasing=T),]
> data2[!duplicated(data2$stations),]
        lat   long depth mag stations
152  -15.56 167.62   127 6.4      122
15   -20.70 169.92   139 6.1       94
17   -13.64 165.96    50 6.0       83
870  -12.23 167.02   242 6.0      132
1000 -21.59 170.56   165 6.0      119
558  -22.91 183.95    64 5.9      118
109  -22.55 185.90    42 5.7       76
151  -23.34 184.50    56 5.7      106
176  -32.22 180.20   216 5.7       90
275  -22.13 180.38   577 5.7      104

另外：

> library(data.table)
> data <- data.table(data)
> data[,.SD[which.max(mag)],by=stations]
     stations    lat   long depth mag
  1:       41 -23.46 180.11   539 5.0
  2:       15 -13.40 166.90   228 4.8
  3:       43 -26.00 184.10    42 5.4
  4:       19 -19.70 186.20    47 4.8
  5:       11 -27.60 182.40    61 4.6
 ---                                 
 98:       77 -21.19 181.58   490 5.0
 99:      132 -12.23 167.02   242 6.0
100:      115 -17.85 181.44   589 5.6
101:      121 -20.25 184.75   107 5.6
102:      110 -19.33 186.16    44 5.4

data.table适用于大型数据集

Answer 2

您也可以尝试tapply：

tapply(data$mag, data$stations, FUN=max)

Answer 3

您也可以尝试新的'dplyr'软件包，它比'plyr'更快，更容易使用。通过将操作与％。％链接在一起，使用Hadley所谓的“像数据操作的语法”，如下所示：

library(dplyr)
df <- as.data.frame(quakes)

df %.%
  group_by(stations) %.%
  summarise(Max = max(mag)) %.%
  arrange(desc(Max)) %.%
  head(5)

Source: local data frame [5 x 2]

  stations Max
1      122 6.4
2       94 6.1
3      132 6.0
4      119 6.0
5       83 6.0

从一列中找到最大值的最快方法，其他列中有多个重复项？

3 个答案: