这个问题可能是微不足道的,但我发现很难解决它。请指导我。
以下是样本数据:
structure(list(Vehicle.ID2 = c("39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25",
"39-25", "39-25", "39-25", "39-25", "39-25", "39-25", "39-25"
), OC_DV = c(".", ".", ".", ".", ".", "CLDV", ".", ".", ".",
".", ".", ".", ".", ".", ".", "OPDV", ".", ".", ".", ".", ".",
".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".", ".",
".", "CLDV", ".", ".", "."), frspacing = c(35.83373, 35.75742,
35.70391, 35.67694, 35.67792, 35.70669, 35.7619, 35.84096, 35.93962,
36.05109, 36.16704, 36.28056, 36.3861, 36.47762, 36.5485, 36.59359,
36.61402, 36.61791, 36.61383, 36.60651, 36.59694, 36.58372, 36.56525,
36.54044, 36.50771, 36.46458, 36.40831, 36.33713, 36.25086, 36.15089,
36.04004, 35.92236, 35.80322, 35.68935, 35.58883, 35.51032, 35.4618,
35.4492, 35.47479)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-39L), .Names = c("Vehicle.ID2", "OC_DV", "frspacing"))
我想在列frspacing
中的标签CLDV
和OPDV
之间的OC_DV
中找到值集的最大值和最小值。然后我想找到他们的不同。
以下是max和mins:
Group Max Min
1 CLDV-OPDV 36.54 35.70
2 OPDV-CLDV 36.62 35.59
以下是绝对差异(第1组的最大值为第1组,反之亦然):
1 0.95
2 0.92
我没有任何代码可以显示我尝试的内容,因为老实说我不知道如何解决这个问题。显然,简单的max
或min
列不起作用。我正在使用dplyr
并且没有找到任何相关内容。
答案 0 :(得分:4)
library(zoo) # for na.locf
library(dplyr)
df[df=="."] = NA
df$group = paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")
df %>% group_by(group) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
Source: local data frame [2 x 3]
group Max Min
(chr) (dbl) (dbl)
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883
使用多个值,我会计算更改并将其用作另一个分组变量: (我复制了这个例子中的数据)
df$group2 = NA
df$group2[which(df$group != lag(df$group))] = 1:length(which(df$group != lag(df$group)))
df$group2 = na.locf(df$group2, na.rm = FALSE)
df %>% group_by(group, group2) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
Source: local data frame [5 x 4]
Groups: group [3]
group group2 Max Min
(chr) (int) (dbl) (dbl)
1 CLDV-CLDV 3 38.09082 34.30454
2 CLDV-OPDV 1 36.54850 35.70669
3 CLDV-OPDV 4 38.90356 34.08951
4 OPDV-CLDV 2 36.61791 35.58883
5 OPDV-CLDV 5 38.18983 34.27874
但如果OC_DV
的组合在每个Vehicle.ID2
中都是不同的,您只需将ID粘贴到群组中即可...
答案 1 :(得分:1)
d <- your_dput
# Build your subsetted dataframes
e <- d[grep("CLDV", d$OC_DV)[1]: grep("OPDV", d$OC_DV),]
f <- d[(grep("OPDV", d$OC_DV): grep("CLDV", d$OC_DV)[2]),]
# Make the diff() calls
diff(c(max(e$frspacing), min(f$frspacing)))
diff(c(max(f$frspacing), min(e$frspacing)))
我的值与您的值不同,您可以手动调整grep值,具体取决于您希望如何处理边界包含/排除。
答案 2 :(得分:1)
以下是基础R解决方案:
MaxMinSeq <- function(df) {
myInd <- which(df$OC_DV != ".")
myVals <- df$frspacing
myTitles <- df$OC_DV[myInd]
myLen <- length(myInd)-1L
NewDf <- as.data.frame(t(sapply(1:myLen, function(x) {
list(Group = paste(c(myTitles[x],"-",myTitles[x+1L]), collapse = ""),
Max = max(myVals[myInd[x]:(myInd[x+1L]-1L)]),
Min = min(myVals[myInd[x]:(myInd[x+1L]-1L)]))})))
for (i in 1:3) {NewDf[,i] <- unlist(NewDf[,i])}
NewDf
}
df2 <- MaxMinSeq(df)
df2
Group Max Min
1 CLDV-OPDV 36.54850 35.70669
2 OPDV-CLDV 36.61791 35.58883
这比上面发布的dplyr
解决方案快得多。观察:
TestDplyr <- function(df) {
df[df=="."] <- NA
df$group <- paste((na.locf(df$OC_DV, na.rm = FALSE)), lead(na.locf(df$OC_DV, na.rm = FALSE, fromLast = TRUE)), sep = "-")
df$group2 <- NA
df$group2[which(df$group != lag(df$group))] <- 1:length(which(df$group != lag(df$group)))
df$group2 <- na.locf(df$group2, na.rm = FALSE)
df %>% group_by(group, group2) %>%
summarise(Max = max(frspacing), Min = min(frspacing)) %>%
filter(!grepl("NA",group ))
}
microbenchmark(Joseph = MaxMinSeq(df), Cabana = TestDplyr(df))
Unit: microseconds
expr min lq mean median uq max neval
Joseph 338.671 377.6695 405.0257 405.9945 429.188 496.718 100
Cabana 2622.336 2698.2810 2890.5430 2765.6045 2977.427 7772.180 100
这是一个非常重要的例子:
myDfs <- lapply(1:10000, function(x) df)
bigDf <- do.call(rbind, myDfs)
bigDf$frspacing[40:nrow(bigDf)] <- runif((nrow(bigDf)-39), 10, 100)
a <- MaxMinSeq(bigDf)
b <- TestDplyr(bigDf)
b <- b[order(b$group2),]
identical(a$Max, b$Max)
[1] TRUE
identical(a$Min, b$Min)
[1] TRUE
system.time(TestDplyr(bigDf))
user system elapsed
1.54 0.00 1.54
system.time(MaxMinSeq(bigDf))
user system elapsed
0.3 0.0 0.3
至于问题的第二部分,我不确定OP对于答案是多么普遍,特别是当有两个以上不同的最终配对时。例如,OP想要找到一行的最大值并将其与所有行的最小值进行比较,还是仅仅比较邻居?下面的函数采用第一种方法(即一般方法)。
GetDiff <- function(df) {
df2 <- cbind(df, t(sapply(1:nrow(df), function(x) {
c(rowMin = min(df[x,2:3]),
rowMax = max(df[x,2:3]))})))
myRows <- 1:nrow(df)
sapply(myRows, function(x) df2$rowMax[x] - min(df2$rowMin[-x]))
}
GetDiff(df2) ## df2 comes from above
[1] 0.95967 0.91122