Question

我有一个数据框，叫它A，看起来像这样：

GroupID  Dist1   Dist2 ...
1        4       4 
1        5       4 
1        3       16 
2        0       4 
2        7       2 
2        8       0 
2        6       4 
2        7       4 
2        8       2 
3        7       4 
3        5       6
...

GroupID是一个因子，Dist1，Dist2是整数。

我有一个派生数据框，SummaryA

GroupID  AveD1  AveD2 ...
1        4       8 
2        6       2
3        6       5
...

对于每个groupID，我需要找到具有最小值的ROW NUMBER，进行进一步操作，并将数据提取到我的摘要集。例如，我需要：

GroupID  MinRowD1  
1        1 
2        4 
3        11

在比赛中，我选择的并不重要，但我对如何得到这个问题感到困惑。我不能使用which（），因为它不能很好地操作因素，我不能使用ave（Fun = min），因为我需要的是位置，而不是最小值。如果我做的事情与每组的最小匹配，我可以有多个匹配，这会搞砸。

有关如何执行此操作的任何建议吗？

Answer 1

使用数据的by和rownames

> dat$row <- 1:nrow(dat)
>  by(dat,dat$GroupID,FUN = function(x) rownames(x)[which.min(x$Dist1)])
dat$GroupID: 1
[1] "3"
---------------------------------------------------------------------------------------- 
dat$GroupID: 2
[1] "4"
---------------------------------------------------------------------------------------- 
dat$GroupID: 3
[1] "11"

这里我假设dat

dat <- read.table(text = 'GroupID  Dist1   Dist2
1        4       4 
1        5       4 
1        3       16 
2        0       4 
2        7       2 
2        8       0 
2        6       4 
2        7       4 
2        8       2 
3        7       4 
3        5       6', header = T)

编辑使用data.table包

的另一种解决方案

我认为data.table提供了更优雅的解决方案：

library(data.table)

dat$row <- 1:nrow(dat)
dtb <- as.data.table (dat)
dtb [,.SD[which.min(Dist1)],by=c('GroupID')]
   GroupID Dist1 Dist2 row
1:       1     3    16   3
2:       2     0     4   4
3:       3     5     6  11

Edit1 行表而不创建行列（@Arun评论）

dtb[, {i = which.min(Dist1); list(Dist1=Dist1[i], 
    Dist2=Dist2[i], rowNew=.I[i])}, by=GroupID]

  GroupID Dist1 Dist2 rowNew
1:       1     3    16   3
2:       2     0     4   4
3:       3     5     6  11

Answer 2

这是一个基础R解决方案;基本思想是按GroupID拆分数据，获取每个行的最小值，然后将其重新组合在一起。有些人认为plyr函数是一种更直观的方法;我确定很快会出现使用其中一个的解决方案...

A$row <- 1:nrow(A)
As <- split(A, A$GroupID)
sapply(As, function(Ai) {Ai$row[which.min(Ai$Dist1)]})

对于大型数据集，split在标量上执行时更快，而不是像这样的数据帧。

rows <- split(1:nrow(A), A$GroupID)
sapply(rows, function(rowi) {rowi[which.min(A$Dist1[rowi])]})

Answer 3

假设来自@ agstudy的答案的dat，那么aggregate()是一个很好的基本功能，可以轻松地做你想要的。（本答案使用which.min()，它在存在多个值的情况下具有有趣的行为，该值在输入向量中取最小值。请参见最后的警告！）。例如

aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = which.min)

> aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = which.min)
  GroupID Dist1 Dist2
1       1     3     1
2       2     1     3
3       3     2     1

获取行ID，或者获取我们可以执行此操作的rownames（在向示例添加一些rownames之后）：

rownames(dat) <- letters[seq_len(nrow(dat))] ## add rownames for effect

## function, pull out for clarity
foo <- function(x, rn) rn[which.min(x)]
## apply via aggregate
aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = foo,
          rn = rownames(dat))

给出了

>     rownames(dat) <- letters[seq_len(nrow(dat))] ## add rownames for effect
> 
>     ## function, pull out for clarity
>     foo <- function(x, rn) rn[which.min(x)]
>     ## apply via aggregate
>     aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat, FUN = foo,
+               rn = rownames(dat))
  GroupID Dist1 Dist2
1       1     c     a
2       2     a     c
3       3     b     a

我发现aggregate()提供的输出比by()更好，公式界面（虽然不是最有效的使用方式）当然非常直观。

警告

如果最少没有重复值，

which.min()会很棒。如果有，which.min()选择具有最小值的第一个值。或者，有which(x == min(x))成语，但任何解决方案都需要处理最小值重复的事实。

dat2 <- dat
dat2 <- rbind(dat2, data.frame(GroupID = 1, Dist1 = 3, Dist2 = 8))

aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2, FUN = which.min)

错过了重复项。

> aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2, FUN = which.min)
  GroupID Dist1 Dist2
1       1     3     1
2       2     1     3
3       3     2     1

与which(x == min(x))成语对比：

out <- aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2,
          FUN = function(x) which(x == min(x)))
> (out <- aggregate(cbind(Dist1, Dist2) ~ GroupID, data = dat2,
+                   FUN = function(x) which(x == min(x))))
  GroupID Dist1 Dist2
1       1  3, 4  1, 2
2       2     1     3
3       3     2     1

虽然使用which(x == min(x))的输出很吸引人，但是对象本身有点复杂，是一个以列表作为组件的数据框：

> str(out)
'data.frame':   3 obs. of  3 variables:
 $ GroupID: num  1 2 3
 $ Dist1  :List of 3
  ..$ 0: int  3 4
  ..$ 1: int 1
  ..$ 2: int 2
 $ Dist2  :List of 3
  ..$ 0: int  1 2
  ..$ 1: int 3
  ..$ 2: int 1

Answer 4

假设dFrame包含您的数据

 install.packages('plyr')
 library('plyr')

试试这个：

 dFrame$GroupID<-as.numeric(dFrame$GroupID) ## casting to numeric type
 dFrame<-arrange(dFrame,Dist1) ## sorting the frame by Dist1 to find min by Dist1
 dFrame$row_name<-1:nrow(dFrame) ## will use this to pick out the index

 newFrame<-data.frame(GroupID = unique(dFrame$GroupID), MinRowD1 = as.numeric(tapply(dFrame$row_name,dFrame$GroupID,FUN = function(x){return (x[1])})

Answer 5

稍微复杂一点，但这应该可以解决问题：

x <- data.frame(GroupID=rep(1:3,each=3),Dist1=rpois(9,5))
x
  GroupID Dist1
1       1    10
2       1     5
3       1     3
4       2     9
5       2     9
6       2    13
7       3    10
8       3    10
9       3     4
sapply(lapply(lapply(split(x,x$GroupID),
    function(y) y[order(y[2]),]),head,1),rownames)
  1   2   3 
"3" "4" "9"

Answer 6

这将返回与每个组中第一个最小值关联的两列中的rownames。它将它们作为带有命名列的数据框返回：

do.call(rbind, 
   by(dat,dat$GroupID,FUN = function(x) c(
                               minD1=rownames(x)[which.min(x[['Dist1']])], 
                               minD2=rownames(x)[which.min(x[['Dist2']])] ) ) )
#-------------
  minD1 minD2
1 "3"   "1"  
2 "4"   "6"  
3 "11"  "10"

如何在R中找到每个因子的最小行数？

6 个答案:

警告