根据rownames和colnames范围提取观察结果

时间:2017-09-22 12:03:35

标签: r

我有两个数据帧 - 一个是基础数据帧,另一个是查询数据帧。

基础数据帧(base_df):

    Mon   Tue   Wed   Thu   Fri   Sat
A  5.23  0.01  6.81  8.67  0.10  6.21
B  6.26  2.19  4.28  5.57  0.16  2.81
C  7.41  2.63  4.32  6.57  0.20  1.69
D  6.17  1.50  5.30  9.22  2.19  5.47
E  1.23  9.01  8.09  1.29  7.65  4.57

查询数据框(query_df):

Person  Start  End
     A    Tue  Thu
     C    Mon  Wed
     D    Thu  Sat
     C    Thu  Sat
     B    Wed  Fri

我想在开始和结束日之间提取特定人的所有观察结果。开始日期和结束日期之间的差异始终为3(包括开始日期和结束日期)。

因此需要的输出是:

Person  Start  End     D1    D2    D3
     A    Tue  Thu   0.01  6.81  8.67
     C    Mon  Wed   7.41  2.63  4.32
     D    Thu  Sat   9.22  2.19  5.47
     C    Thu  Sat   6.57  0.20  1.69
     B    Wed  Fri   4.28  5.57  0.16

我想避免循环,因为实际的base_df超过35000行。有没有data.table解决方案?使用其他数据结构的解决方案也很好。谢谢!

5 个答案:

答案 0 :(得分:3)

另一个基础R解决方案,使用mapply ...

query_df <- cbind(query_df,
                  t(mapply(function(p,s,e) {
                     base_df[p, match(s, names(base_df)):match(e, names(base_df))]},
                           query_df$Person,
                           query_df$Start,
                           query_df$End)))
names(query_df)[4:6] <- c("D1", "D2", "D3")

query_df
  Person Start End   D1   D2   D3
1      A   Tue Thu 0.01 6.81 8.67
2      C   Mon Wed 7.41 2.63 4.32
3      D   Thu Sat 9.22 2.19 5.47
4      C   Thu Sat 6.57  0.2 1.69
5      B   Wed Fri 4.28 5.57 0.16

答案 1 :(得分:2)

由于非equi加入,以下data.table解决方案也适用于StartEnd天(不仅仅是3天)之间的不同天数melt() / dcast()进行重塑:

library(data.table)
setDT(base_df)
setDT(query_df)

# reshape from wide to long
long <- melt(base_df, id.vars = "Person", variable.name = "Day")

# align factor levels
cols <- c("Start", "End")
query_df[, (cols) := lapply(.SD, factor, levels = levels(long$Day)), .SDcols = cols][
  # add row id because Person is not unique
  , rn := .I]

# non-equi join right join, i.e., take all rows of query_df
long[query_df, on = .(Person, Day >= Start, Day <= End), 
     .(rn, Person, Start = i.Start, End = i.End, value)][
       # reshape from long to wide
       , dcast(.SD, rn + Person + ... ~ rowid(rn, prefix = "D"))]
   rn Person Start End   D1   D2   D3
1:  1      A   Tue Thu 0.01 6.81 8.67
2:  2      C   Mon Wed 7.41 2.63 4.32
3:  3      D   Thu Sat 9.22 2.19 5.47
4:  4      C   Thu Sat 6.57 0.20 1.69
5:  5      B   Wed Fri 4.28 5.57 0.16

请注意,Day是一个因素,其中工作日的名称为外观顺序的因子级别:

 str(long)
Classes ‘data.table’ and 'data.frame':    30 obs. of  3 variables:
 $ Person: chr  "A" "B" "C" "D" ...
 $ Day   : Factor w/ 6 levels "Mon","Tue","Wed",..: 1 1 1 1 1 2 2 2 2 2 ...
 $ value : num  5.23 6.26 7.41 6.17 1.23 0.01 2.19 2.63 1.5 9.01 ...
 - attr(*, ".internal.selfref")=<externalptr>

对齐因子水平对于非equi连接至关重要。

数据

library(data.table)

base_df <- fread(
  "Person    Mon   Tue   Wed   Thu   Fri   Sat
A  5.23  0.01  6.81  8.67  0.10  6.21
B  6.26  2.19  4.28  5.57  0.16  2.81
C  7.41  2.63  4.32  6.57  0.20  1.69
D  6.17  1.50  5.30  9.22  2.19  5.47
E  1.23  9.01  8.09  1.29  7.65  4.57"
)

query_df <- fread(
  "Person  Start  End
  A    Tue  Thu
  C    Mon  Wed
  D    Thu  Sat
  C    Thu  Sat
  B    Wed  Fri"
)

答案 2 :(得分:1)

data.table解决方案:

我在这里使用getMon对象中提取列(例如data.table)。

library(data.table)
# Prepare data
base_df$Person <- rownames(base_df)
d <- merge(query_df, base_df, "Person", sort = FALSE)
setDT(d)

# Extract mid day (day between start and end)
d[, Mid := days[which(Start == days) + 1], 1:nrow(d)]
# Extract columns using get
d[, .(Person, Start, End, 
      D1 = get(Start), D2 = get(Mid), D3 = get(End)), 1:nrow(d)][, nrow := NULL][]

   Person Start End   D1   D2   D3
1:      A   Tue Thu 0.01 6.81 8.67
2:      C   Mon Wed 7.41 2.63 4.32
3:      D   Thu Sat 9.22 2.19 5.47
4:      C   Thu Sat 6.57 0.20 1.69
5:      B   Wed Fri 4.28 5.57 0.16

基础R解决方案:

# Order of days
days <- names(base_df)
# Order of persons
subjects <- rownames(base_df)

res <- apply(query_df, 1, function(x) {
    # Extract observation between start:end date
    foo <- base_df[x[1] == subjects, which(x[2] == days):which(x[3] == days)]
    colnames(foo) <- paste0("D", 1:3)
    foo})
# Merge with original query_df
res <- cbind(query_df, do.call("rbind", res))
rownames(res) <- NULL
res

答案 3 :(得分:1)

tidyverse回答

我重塑base_df,然后加入并切片正确的日子,然后重新塑造。

library(tidyr)
library(dplyr)

base_df <- tibble::rownames_to_column(base_df, 'Person')
days <- names(base_df)[-1]

base_df %>% 
  gather(day, value, -Person) %>% 
  right_join(mutate(query_df, i = row_number())) %>% 
  group_by(i) %>% 
  slice(which(days == Start):which(days == End)) %>% 
  mutate(col = c('D1', 'D2', 'D3')) %>% 
  select(-day, -i) %>% 
  spread(col, value) 

答案 4 :(得分:1)

使用带数字矩阵索引的base解决方案:

ri <- match(query_df$Person, rownames(base_df))
ci <- match(query_df$Start, names(base_df))
cbind(query_df, `dim<-`(base_df[cbind(ri, rep(ci, 3) + rep(0:2, each = nrow(query_df)))],
                        c(nrow(query_df), 3)))

#   Person Start End    1    2    3
# 1      A   Tue Thu 0.01 6.81 8.67
# 2      C   Mon Wed 7.41 2.63 4.32
# 3      D   Thu Sat 9.22 2.19 5.47
# 4      C   Thu Sat 6.57 0.20 1.69
# 5      B   Wed Fri 4.28 5.57 0.16