我有两个数据帧 - 一个是基础数据帧,另一个是查询数据帧。
基础数据帧(base_df
):
Mon Tue Wed Thu Fri Sat
A 5.23 0.01 6.81 8.67 0.10 6.21
B 6.26 2.19 4.28 5.57 0.16 2.81
C 7.41 2.63 4.32 6.57 0.20 1.69
D 6.17 1.50 5.30 9.22 2.19 5.47
E 1.23 9.01 8.09 1.29 7.65 4.57
查询数据框(query_df
):
Person Start End
A Tue Thu
C Mon Wed
D Thu Sat
C Thu Sat
B Wed Fri
我想在开始和结束日之间提取特定人的所有观察结果。开始日期和结束日期之间的差异始终为3(包括开始日期和结束日期)。
因此需要的输出是:
Person Start End D1 D2 D3
A Tue Thu 0.01 6.81 8.67
C Mon Wed 7.41 2.63 4.32
D Thu Sat 9.22 2.19 5.47
C Thu Sat 6.57 0.20 1.69
B Wed Fri 4.28 5.57 0.16
我想避免循环,因为实际的base_df超过35000行。有没有data.table解决方案?使用其他数据结构的解决方案也很好。谢谢!
答案 0 :(得分:3)
另一个基础R解决方案,使用mapply
...
query_df <- cbind(query_df,
t(mapply(function(p,s,e) {
base_df[p, match(s, names(base_df)):match(e, names(base_df))]},
query_df$Person,
query_df$Start,
query_df$End)))
names(query_df)[4:6] <- c("D1", "D2", "D3")
query_df
Person Start End D1 D2 D3
1 A Tue Thu 0.01 6.81 8.67
2 C Mon Wed 7.41 2.63 4.32
3 D Thu Sat 9.22 2.19 5.47
4 C Thu Sat 6.57 0.2 1.69
5 B Wed Fri 4.28 5.57 0.16
答案 1 :(得分:2)
由于非equi加入,以下data.table
解决方案也适用于Start
和End
天(不仅仅是3天)之间的不同天数和melt()
/ dcast()
进行重塑:
library(data.table)
setDT(base_df)
setDT(query_df)
# reshape from wide to long
long <- melt(base_df, id.vars = "Person", variable.name = "Day")
# align factor levels
cols <- c("Start", "End")
query_df[, (cols) := lapply(.SD, factor, levels = levels(long$Day)), .SDcols = cols][
# add row id because Person is not unique
, rn := .I]
# non-equi join right join, i.e., take all rows of query_df
long[query_df, on = .(Person, Day >= Start, Day <= End),
.(rn, Person, Start = i.Start, End = i.End, value)][
# reshape from long to wide
, dcast(.SD, rn + Person + ... ~ rowid(rn, prefix = "D"))]
rn Person Start End D1 D2 D3 1: 1 A Tue Thu 0.01 6.81 8.67 2: 2 C Mon Wed 7.41 2.63 4.32 3: 3 D Thu Sat 9.22 2.19 5.47 4: 4 C Thu Sat 6.57 0.20 1.69 5: 5 B Wed Fri 4.28 5.57 0.16
请注意,Day
是一个因素,其中工作日的名称为外观顺序的因子级别:
str(long)
Classes ‘data.table’ and 'data.frame': 30 obs. of 3 variables: $ Person: chr "A" "B" "C" "D" ... $ Day : Factor w/ 6 levels "Mon","Tue","Wed",..: 1 1 1 1 1 2 2 2 2 2 ... $ value : num 5.23 6.26 7.41 6.17 1.23 0.01 2.19 2.63 1.5 9.01 ... - attr(*, ".internal.selfref")=<externalptr>
对齐因子水平对于非equi连接至关重要。
library(data.table)
base_df <- fread(
"Person Mon Tue Wed Thu Fri Sat
A 5.23 0.01 6.81 8.67 0.10 6.21
B 6.26 2.19 4.28 5.57 0.16 2.81
C 7.41 2.63 4.32 6.57 0.20 1.69
D 6.17 1.50 5.30 9.22 2.19 5.47
E 1.23 9.01 8.09 1.29 7.65 4.57"
)
query_df <- fread(
"Person Start End
A Tue Thu
C Mon Wed
D Thu Sat
C Thu Sat
B Wed Fri"
)
答案 2 :(得分:1)
data.table解决方案:
我在这里使用get
从Mon
对象中提取列(例如data.table
)。
library(data.table)
# Prepare data
base_df$Person <- rownames(base_df)
d <- merge(query_df, base_df, "Person", sort = FALSE)
setDT(d)
# Extract mid day (day between start and end)
d[, Mid := days[which(Start == days) + 1], 1:nrow(d)]
# Extract columns using get
d[, .(Person, Start, End,
D1 = get(Start), D2 = get(Mid), D3 = get(End)), 1:nrow(d)][, nrow := NULL][]
Person Start End D1 D2 D3
1: A Tue Thu 0.01 6.81 8.67
2: C Mon Wed 7.41 2.63 4.32
3: D Thu Sat 9.22 2.19 5.47
4: C Thu Sat 6.57 0.20 1.69
5: B Wed Fri 4.28 5.57 0.16
基础R解决方案:
# Order of days
days <- names(base_df)
# Order of persons
subjects <- rownames(base_df)
res <- apply(query_df, 1, function(x) {
# Extract observation between start:end date
foo <- base_df[x[1] == subjects, which(x[2] == days):which(x[3] == days)]
colnames(foo) <- paste0("D", 1:3)
foo})
# Merge with original query_df
res <- cbind(query_df, do.call("rbind", res))
rownames(res) <- NULL
res
答案 3 :(得分:1)
tidyverse
回答
我重塑base_df
,然后加入并切片正确的日子,然后重新塑造。
library(tidyr)
library(dplyr)
base_df <- tibble::rownames_to_column(base_df, 'Person')
days <- names(base_df)[-1]
base_df %>%
gather(day, value, -Person) %>%
right_join(mutate(query_df, i = row_number())) %>%
group_by(i) %>%
slice(which(days == Start):which(days == End)) %>%
mutate(col = c('D1', 'D2', 'D3')) %>%
select(-day, -i) %>%
spread(col, value)
答案 4 :(得分:1)
使用带数字矩阵索引的base
解决方案:
ri <- match(query_df$Person, rownames(base_df))
ci <- match(query_df$Start, names(base_df))
cbind(query_df, `dim<-`(base_df[cbind(ri, rep(ci, 3) + rep(0:2, each = nrow(query_df)))],
c(nrow(query_df), 3)))
# Person Start End 1 2 3
# 1 A Tue Thu 0.01 6.81 8.67
# 2 C Mon Wed 7.41 2.63 4.32
# 3 D Thu Sat 9.22 2.19 5.47
# 4 C Thu Sat 6.57 0.20 1.69
# 5 B Wed Fri 4.28 5.57 0.16