如何在R中的列中找到值的序列中的最大值和最小值?

时间:2016-06-16 20:02:02

标签: r

这个问题可能是微不足道的,但我发现很难解决它。请指导我。

数据

以下是样本数据:

structure(list(Vehicle.ID2 = c("39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25", 
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25"
), OC_DV = c(".", ".", ".", ".", ".", "CLDV", ".", ".", ".", 
".", ".", ".", ".", ".", ".", "OPDV", ".", ".", ".", ".", ".", 
".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", 
".", "CLDV", ".", ".", "."), frspacing = c(35.83373, 35.75742, 
35.70391, 35.67694, 35.67792, 35.70669, 35.7619, 35.84096, 35.93962, 
36.05109, 36.16704, 36.28056, 36.3861, 36.47762, 36.5485, 36.59359, 
36.61402, 36.61791, 36.61383, 36.60651, 36.59694, 36.58372, 36.56525, 
36.54044, 36.50771, 36.46458, 36.40831, 36.33713, 36.25086, 36.15089, 
36.04004, 35.92236, 35.80322, 35.68935, 35.58883, 35.51032, 35.4618, 
35.4492, 35.47479)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-39L), .Names = c("Vehicle.ID2", "OC_DV", "frspacing"))  

我想做什么

我想在列frspacing中的标签CLDVOPDV之间的OC_DV中找到值集的最大值和最小值。然后我想找到他们的不同。

期望输出

以下是max和mins:

  Group      Max    Min
1 CLDV-OPDV 36.54   35.70
2 OPDV-CLDV 36.62   35.59  

以下是绝对差异(第1组的最大值为第1组,反之亦然):

1 0.95
2 0.92

我没有任何代码可以显示我尝试的内容,因为老实说我不知道​​如何解决这个问题。显然,简单的maxmin列不起作用。我正在使用dplyr并且没有找到任何相关内容。

3 个答案:

答案 0 :(得分:4)

 library(zoo) # for na.locf
 library(dplyr)

 df[df=="."] = NA
 df$group = paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")

 df %>% group_by(group) %>% 
   summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
   filter(!grepl("NA",group ))

Source: local data frame [2 x 3]

      group      Max      Min
      (chr)    (dbl)    (dbl)
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883

使用多个值,我会计算更改并将其用作另一个分组变量: (我复制了这个例子中的数据)

df$group2 = NA
df$group2[which(df$group != lag(df$group))] = 1:length(which(df$group != lag(df$group)))
df$group2 = na.locf(df$group2, na.rm = FALSE)

df %>% group_by(group, group2) %>% 
  summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
   filter(!grepl("NA",group ))

Source: local data frame [5 x 4]
Groups: group [3]

      group group2      Max      Min
      (chr)  (int)    (dbl)    (dbl)
1 CLDV-CLDV      3 38.09082 34.30454
2 CLDV-OPDV      1 36.54850 35.70669
3 CLDV-OPDV      4 38.90356 34.08951
4 OPDV-CLDV      2 36.61791 35.58883
5 OPDV-CLDV      5 38.18983 34.27874

但如果OC_DV的组合在每个Vehicle.ID2中都是不同的,您只需将ID粘贴到群组中即可...

答案 1 :(得分:1)

d <- your_dput
# Build your subsetted dataframes
e <- d[grep("CLDV", d$OC_DV)[1]: grep("OPDV", d$OC_DV),]
f <- d[(grep("OPDV", d$OC_DV): grep("CLDV", d$OC_DV)[2]),]
# Make the diff() calls
diff(c(max(e$frspacing), min(f$frspacing)))
diff(c(max(f$frspacing), min(e$frspacing)))

我的值与您的值不同,您可以手动调整grep值,具体取决于您希望如何处理边界包含/排除。

答案 2 :(得分:1)

以下是基础R解决方案:

MaxMinSeq <- function(df) {
    myInd <- which(df$OC_DV != ".")
    myVals <- df$frspacing
    myTitles <- df$OC_DV[myInd]
    myLen <- length(myInd)-1L
    NewDf <- as.data.frame(t(sapply(1:myLen, function(x) {
               list(Group = paste(c(myTitles[x],"-",myTitles[x+1L]), collapse = ""),
                   Max = max(myVals[myInd[x]:(myInd[x+1L]-1L)]),
                   Min = min(myVals[myInd[x]:(myInd[x+1L]-1L)]))})))
    for (i in 1:3) {NewDf[,i] <- unlist(NewDf[,i])}
    NewDf
}

df2 <- MaxMinSeq(df)
df2
      Group      Max      Min
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883

这比上面发布的dplyr解决方案快得多。观察:

TestDplyr <- function(df) {
    df[df=="."] <- NA
    df$group <- paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")

    df$group2 <- NA
    df$group2[which(df$group != lag(df$group))] <- 1:length(which(df$group != lag(df$group)))
    df$group2 <- na.locf(df$group2, na.rm = FALSE)

    df %>% group_by(group, group2) %>% 
        summarise(Max = max(frspacing), Min = min(frspacing)) %>% 
        filter(!grepl("NA",group ))
}

microbenchmark(Joseph = MaxMinSeq(df), Cabana = TestDplyr(df))
Unit: microseconds
expr      min        lq      mean    median       uq      max neval
Joseph  338.671  377.6695  405.0257  405.9945  429.188  496.718   100
Cabana 2622.336 2698.2810 2890.5430 2765.6045 2977.427 7772.180   100

这是一个非常重要的例子:

myDfs <- lapply(1:10000, function(x) df)
bigDf <- do.call(rbind, myDfs)
bigDf$frspacing[40:nrow(bigDf)] <- runif((nrow(bigDf)-39), 10, 100)

a <- MaxMinSeq(bigDf)
b <- TestDplyr(bigDf)
b <- b[order(b$group2),]

identical(a$Max, b$Max)
[1] TRUE
identical(a$Min, b$Min)
[1] TRUE

system.time(TestDplyr(bigDf))
 user  system elapsed 
 1.54    0.00    1.54 
system.time(MaxMinSeq(bigDf))
 user  system elapsed 
  0.3     0.0     0.3

至于问题的第二部分,我不确定OP对于答案是多么普遍,特别是当有两个以上不同的最终配对时。例如,OP想要找到一行的最大值并将其与所有行的最小值进行比较,还是仅仅比较邻居?下面的函数采用第一种方法(即一般方法)。

GetDiff <- function(df) {
    df2 <- cbind(df, t(sapply(1:nrow(df), function(x) {
                        c(rowMin = min(df[x,2:3]),
                          rowMax = max(df[x,2:3]))})))
    myRows <- 1:nrow(df)
    sapply(myRows, function(x) df2$rowMax[x] - min(df2$rowMin[-x]))
}

GetDiff(df2)   ## df2 comes from above
[1] 0.95967 0.91122