我的朋友和我一直绞尽脑汁想要如何从以下示例数据集中找到中位数:
A <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) #15 minute intervals
B <- c(4.1, 3.3, 11.7, 3.9, 2.9, 3.6, 4.8, 3.5, 5.0, 4.4, 4.9, 9.9, 8.5, 11.0, 14.0) #Blood glucose mmolperL
C <- c(NA, NA, 130, NA, NA, NA, NA, 115, NA, NA, NA, 120, NA, NA, NA) #Systolic Blood pressure
DF <- cbind(A,B,C)
从上述数据集中,我们希望知道在采集收缩压(C列)时的中位血糖值(B列)。问题在于与收缩压(130)的读数在同一行中的第一次血糖读数(11.7)与该时间点附近的其他读数完全不同。
我们希望将血糖的数据点在这个11.7值附近并计算中位数并将其分配给相应的血压。
注意!! !!这是一个实验的一个示例数据集。在其他实验中,时间间隔不是很整齐,因此我们不能使用基于A列的常规子集标准。真正的数据帧也很多,更大,血压之间的行数更多读数。我简化了这个例子的数据框架。答案 0 :(得分:2)
A possible solution:
w <- which(!is.na(DF$C))
DF[w, 'B'] <- aggregate(B ~ rep(1:length(w), each = 3), DF[rep(w, each = 3) + c(-1,0,1),], median)$B
which gives:
> DF A B C 1 1 4.1 NA 2 2 3.3 NA 3 3 3.9 130 4 4 3.9 NA 5 5 2.9 NA 6 6 3.6 NA 7 7 4.8 NA 8 8 4.8 115 9 9 5.0 NA 10 10 4.4 NA 11 11 4.9 NA 12 12 8.5 120 13 13 8.5 NA 14 14 11.0 NA 15 15 14.0 NA
What this does:
w <- which(!is.na(DF$C))
creates an index w
where C
is not NA.aggregate
you can calculate the median
for the needed rows. In this case I chose to take only the row itself and the row before and after the row where C
has a value.DF[rep(w, each = 3) + c(-1,0,1),]
filters DF
to only the needed rowsrep(1:length(w), each = 3)
creates a grouping vector for aggregate
B
-column for the rownumbers in w
.You can also use this logic with the data.table
-package:
# load the 'data.table'-package and convert 'DF' to a data.table with 'setDF'
library(data.table)
setDT(DF)
# create two indexes:
# 'i1' for when 'C' has a value
# 'i2' which includes the previous and the next row for each value in 'i1'
i1 <- DF[, .I[!is.na(C)]]
i2 <- rep(i1, each = 3)
# replace 'B' by reference with the median
DF[i1, B := DF[i2 + -1:1, median(B), i2]$V1][]
Because the actual data is a lot larger (as stated in the question) it is worthwile to test the different solutions on a much larger dataset.
First, let's create a big dataset that mimics the original DF
from the question:
DFbig <- DF[sample(nrow(DF), 1e7, TRUE),]
setDT(DFbig)
i <- DFbig[, .I[!is.na(C) & (!is.na(shift(C, type = 'lag')) | !is.na(shift(C, type = 'lead')))]]
d <- c(2L,diff(i))
i <- i[d > 1]
DFbig2 <- DFbig[!i]
The timings for the base R solution:
DFtest <- as.data.frame(DFbig2)
system.time(
{w <- which(!is.na(DFtest$C)); DFtest[w, 'B'] <- aggregate(B ~ rep(1:length(w), each = 3), DFtest[rep(w, each = 3) + c(-1,0,1),], median)$B}
)
user system elapsed 52.049 0.997 53.084
The timings for the dplyr
solution:
DFtest <- as.data.frame(DFbig2)
system.time(
DFtest %>% mutate(lag_B = lag(B), lead_B = lead(B)) %>% rowwise() %>% mutate(B = ifelse(is.na(C), NA_integer_, median(c(lag_B, B, lead_B))) ) %>% select(A, B, C)
)
user system elapsed 174.725 1.652 176.721
The timings for the data.table
solution:
DFtest <- copy(DFbig2)
system.time(
{i1 <- DFtest[, .I[!is.na(C)]]; i2 <- rep(i1, each = 3); DFtest[i1, B := DFtest[i2 + -1:1, median(B), i2]$V1][]}
)
user system elapsed 0.300 0.057 0.359
As is quite clear from the test results: the data.table
-solution is the fastest, followed by the base R solution and while the dplyr
-solution is by far the slowest.
Used data:
DF <- data.frame(A = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
B = c(4.1, 3.3, 11.7, 3.9, 2.9, 3.6, 4.8, 3.5, 5.0, 4.4, 4.9, 9.9, 8.5, 11.0, 14.0),
C = c(NA, NA, 130, NA, NA, NA, NA, 115, NA, NA, NA, 120, NA, NA, NA))
答案 1 :(得分:0)
虽然@Jaap为原始问题提供了一个非常好的解决方案。我还在试图找到一种不使用aggregate
的方法。
我想考虑previous
,next
和当前阅读B
(C
包含有效值)来计算median
。
library(dplyr)
DF %>%
mutate(lag_B = lag(B), lead_B = lead(B)) %>%
rowwise() %>%
mutate(median_B = ifelse(is.na(C), NA_integer_,median(c(lag_B, B, lead_B))) ) %>%
select(A, B, C, median_B)
Results:
# A tibble: 15 x 4
# A B C median_B
# <dbl> <dbl> <dbl> <dbl>
# 1 1.00 4.10 NA NA
# 2 2.00 3.30 NA NA
# 3 3.00 11.7 130 3.90
# 4 4.00 3.90 NA NA
# 5 5.00 2.90 NA NA
# 6 6.00 3.60 NA NA
# 7 7.00 4.80 NA NA
# 8 8.00 3.50 115 4.80
# 9 9.00 5.00 NA NA
#10 10.0 4.40 NA NA
#11 11.0 4.90 NA NA
#12 12.0 9.90 120 8.50
#13 13.0 8.50 NA NA
#14 14.0 11.0 NA NA
#15 15.0 14.0 NA NA