
时间:2018-01-23 18:45:29

标签: r


A <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15) #15 minute intervals
B <- c(4.1, 3.3, 11.7, 3.9, 2.9, 3.6, 4.8, 3.5, 5.0, 4.4, 4.9, 9.9, 8.5, 11.0, 14.0) #Blood glucose mmolperL
C <- c(NA, NA, 130, NA, NA, NA, NA, 115, NA, NA, NA, 120, NA, NA, NA) #Systolic Blood pressure
DF <- cbind(A,B,C)



注意!! !!这是一个实验的一个示例数据集。在其他实验中,时间间隔不是很整齐,因此我们不能使用基于A列的常规子集标准。真正的数据帧也很多,更大,血压之间的行数更多读数。我简化了这个例子的数据框架。

2 个答案:

答案 0 :(得分:2)

A possible solution:

w <- which(!is.na(DF$C))

DF[w, 'B'] <- aggregate(B ~ rep(1:length(w), each = 3), DF[rep(w, each = 3) + c(-1,0,1),], median)$B

which gives:

> DF
    A    B   C
1   1  4.1  NA
2   2  3.3  NA
3   3  3.9 130
4   4  3.9  NA
5   5  2.9  NA
6   6  3.6  NA
7   7  4.8  NA
8   8  4.8 115
9   9  5.0  NA
10 10  4.4  NA
11 11  4.9  NA
12 12  8.5 120
13 13  8.5  NA
14 14 11.0  NA
15 15 14.0  NA

What this does:

  • w <- which(!is.na(DF$C)) creates an index w where C is not NA.
  • With aggregate you can calculate the median for the needed rows. In this case I chose to take only the row itself and the row before and after the row where C has a value.
  • DF[rep(w, each = 3) + c(-1,0,1),] filters DF to only the needed rows
  • rep(1:length(w), each = 3) creates a grouping vector for aggregate
  • The result is assigned back to the B-column for the rownumbers in w.

You can also use this logic with the data.table-package:

# load the 'data.table'-package and convert 'DF' to a data.table with 'setDF'

# create two indexes:
# 'i1' for when 'C' has a value
# 'i2' which includes the previous and the next row for each value in 'i1'
i1 <- DF[, .I[!is.na(C)]]
i2 <- rep(i1, each = 3)

# replace 'B' by reference with the median
DF[i1, B := DF[i2 + -1:1, median(B), i2]$V1][]

Because the actual data is a lot larger (as stated in the question) it is worthwile to test the different solutions on a much larger dataset.

First, let's create a big dataset that mimics the original DF from the question:

DFbig <- DF[sample(nrow(DF), 1e7, TRUE),]
i <- DFbig[, .I[!is.na(C) & (!is.na(shift(C, type = 'lag')) | !is.na(shift(C, type = 'lead')))]]
d <- c(2L,diff(i))
i <- i[d > 1]
DFbig2 <- DFbig[!i]

The timings for the base R solution:

DFtest <- as.data.frame(DFbig2)

  {w <- which(!is.na(DFtest$C)); DFtest[w, 'B'] <- aggregate(B ~ rep(1:length(w), each = 3), DFtest[rep(w, each = 3) + c(-1,0,1),], median)$B}
   user  system elapsed 
 52.049   0.997  53.084

The timings for the dplyr solution:

DFtest <- as.data.frame(DFbig2)

  DFtest %>% mutate(lag_B = lag(B), lead_B = lead(B)) %>% rowwise() %>% mutate(B = ifelse(is.na(C), NA_integer_, median(c(lag_B, B, lead_B))) ) %>% select(A, B, C)
   user  system elapsed 
174.725   1.652 176.721

The timings for the data.table solution:

DFtest <- copy(DFbig2)

  {i1 <- DFtest[, .I[!is.na(C)]]; i2 <- rep(i1, each = 3); DFtest[i1, B := DFtest[i2 + -1:1, median(B), i2]$V1][]}
   user  system elapsed 
  0.300   0.057   0.359

As is quite clear from the test results: the data.table-solution is the fastest, followed by the base R solution and while the dplyr-solution is by far the slowest.

Used data:

DF <- data.frame(A = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15),
                 B = c(4.1, 3.3, 11.7, 3.9, 2.9, 3.6, 4.8, 3.5, 5.0, 4.4, 4.9, 9.9, 8.5, 11.0, 14.0),
                 C = c(NA, NA, 130, NA, NA, NA, NA, 115, NA, NA, NA, 120, NA, NA, NA))

答案 1 :(得分:0)



    DF %>%
      mutate(lag_B = lag(B), lead_B = lead(B)) %>%
      rowwise() %>%
      mutate(median_B = ifelse(is.na(C), NA_integer_,median(c(lag_B, B, lead_B))) ) %>%
      select(A, B, C, median_B)

# A tibble: 15 x 4
#       A     B     C median_B
#   <dbl> <dbl> <dbl>    <dbl>
# 1  1.00  4.10    NA    NA   
# 2  2.00  3.30    NA    NA   
# 3  3.00 11.7    130     3.90
# 4  4.00  3.90    NA    NA   
# 5  5.00  2.90    NA    NA   
# 6  6.00  3.60    NA    NA   
# 7  7.00  4.80    NA    NA   
# 8  8.00  3.50   115     4.80
# 9  9.00  5.00    NA    NA   
#10 10.0   4.40    NA    NA   
#11 11.0   4.90    NA    NA   
#12 12.0   9.90   120     8.50
#13 13.0   8.50    NA    NA   
#14 14.0  11.0     NA    NA   
#15 15.0  14.0     NA    NA