根据另一列

时间:2018-01-23 18:45:29

标签: r

我的朋友和我一直绞尽脑汁想要如何从以下示例数据集中找到中位数:

A <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) #15 minute intervals
B <- c(4.1, 3.3, 11.7, 3.9, 2.9, 3.6, 4.8, 3.5, 5.0, 4.4, 4.9, 9.9, 8.5, 11.0, 14.0) #Blood glucose mmolperL
C <- c(NA, NA, 130, NA, NA, NA, NA, 115, NA, NA, NA, 120, NA, NA, NA) #Systolic Blood pressure
DF <- cbind(A,B,C)

从上述数据集中,我们希望知道在采集收缩压(C列)时的中位血糖值(B列)。问题在于与收缩压(130)的读数在同一行中的第一次血糖读数(11.7)与该时间点附近的其他读数完全不同。

我们希望将血糖的数据点在这个11.7值附近并计算中位数并将其分配给相应的血压。

注意!! !!这是一个实验的一个示例数据集。在其他实验中,时间间隔不是很整齐,因此我们不能使用基于A列的常规子集标准。真正的数据帧也很多,更大,血压之间的行数更多读数。我简化了这个例子的数据框架。

2 个答案:

答案 0 :(得分:2)

A possible solution:

w <- which(!is.na(DF$C))

DF[w, 'B'] <- aggregate(B ~ rep(1:length(w), each = 3), DF[rep(w, each = 3) + c(-1,0,1),], median)$B

which gives:

> DF
    A    B   C
1   1  4.1  NA
2   2  3.3  NA
3   3  3.9 130
4   4  3.9  NA
5   5  2.9  NA
6   6  3.6  NA
7   7  4.8  NA
8   8  4.8 115
9   9  5.0  NA
10 10  4.4  NA
11 11  4.9  NA
12 12  8.5 120
13 13  8.5  NA
14 14 11.0  NA
15 15 14.0  NA

What this does:

  • w <- which(!is.na(DF$C)) creates an index w where C is not NA.
  • With aggregate you can calculate the median for the needed rows. In this case I chose to take only the row itself and the row before and after the row where C has a value.
  • DF[rep(w, each = 3) + c(-1,0,1),] filters DF to only the needed rows
  • rep(1:length(w), each = 3) creates a grouping vector for aggregate
  • The result is assigned back to the B-column for the rownumbers in w.

You can also use this logic with the data.table-package:

# load the 'data.table'-package and convert 'DF' to a data.table with 'setDF'
library(data.table)
setDT(DF)

# create two indexes:
# 'i1' for when 'C' has a value
# 'i2' which includes the previous and the next row for each value in 'i1'
i1 <- DF[, .I[!is.na(C)]]
i2 <- rep(i1, each = 3)

# replace 'B' by reference with the median
DF[i1, B := DF[i2 + -1:1, median(B), i2]$V1][]

Because the actual data is a lot larger (as stated in the question) it is worthwile to test the different solutions on a much larger dataset.

First, let's create a big dataset that mimics the original DF from the question:

DFbig <- DF[sample(nrow(DF), 1e7, TRUE),]
setDT(DFbig)
i <- DFbig[, .I[!is.na(C) & (!is.na(shift(C, type = 'lag')) | !is.na(shift(C, type = 'lead')))]]
d <- c(2L,diff(i))
i <- i[d > 1]
DFbig2 <- DFbig[!i]

The timings for the base R solution:

DFtest <- as.data.frame(DFbig2)

system.time(
  {w <- which(!is.na(DFtest$C)); DFtest[w, 'B'] <- aggregate(B ~ rep(1:length(w), each = 3), DFtest[rep(w, each = 3) + c(-1,0,1),], median)$B}
)
   user  system elapsed 
 52.049   0.997  53.084

The timings for the dplyr solution:

DFtest <- as.data.frame(DFbig2)

system.time(
  DFtest %>% mutate(lag_B = lag(B), lead_B = lead(B)) %>% rowwise() %>% mutate(B = ifelse(is.na(C), NA_integer_, median(c(lag_B, B, lead_B))) ) %>% select(A, B, C)
)
   user  system elapsed 
174.725   1.652 176.721

The timings for the data.table solution:

DFtest <- copy(DFbig2)

system.time(
  {i1 <- DFtest[, .I[!is.na(C)]]; i2 <- rep(i1, each = 3); DFtest[i1, B := DFtest[i2 + -1:1, median(B), i2]$V1][]}
)
   user  system elapsed 
  0.300   0.057   0.359

As is quite clear from the test results: the data.table-solution is the fastest, followed by the base R solution and while the dplyr-solution is by far the slowest.


Used data:

DF <- data.frame(A = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
                 B = c(4.1, 3.3, 11.7, 3.9, 2.9, 3.6, 4.8, 3.5, 5.0, 4.4, 4.9, 9.9, 8.5, 11.0, 14.0),
                 C = c(NA, NA, 130, NA, NA, NA, NA, 115, NA, NA, NA, 120, NA, NA, NA))

答案 1 :(得分:0)

虽然@Jaap为原始问题提供了一个非常好的解决方案。我还在试图找到一种不使用aggregate的方法。

我想考虑previousnext和当前阅读BC包含有效值)来计算median

    library(dplyr)
    DF %>%
      mutate(lag_B = lag(B), lead_B = lead(B)) %>%
      rowwise() %>%
      mutate(median_B = ifelse(is.na(C), NA_integer_,median(c(lag_B, B, lead_B))) ) %>%
      select(A, B, C, median_B)

Results:
# A tibble: 15 x 4
#       A     B     C median_B
#   <dbl> <dbl> <dbl>    <dbl>
# 1  1.00  4.10    NA    NA   
# 2  2.00  3.30    NA    NA   
# 3  3.00 11.7    130     3.90
# 4  4.00  3.90    NA    NA   
# 5  5.00  2.90    NA    NA   
# 6  6.00  3.60    NA    NA   
# 7  7.00  4.80    NA    NA   
# 8  8.00  3.50   115     4.80
# 9  9.00  5.00    NA    NA   
#10 10.0   4.40    NA    NA   
#11 11.0   4.90    NA    NA   
#12 12.0   9.90   120     8.50
#13 13.0   8.50    NA    NA   
#14 14.0  11.0     NA    NA   
#15 15.0  14.0     NA    NA