我有一个data.frame
,它的格式是交错的,因此有两个组(A和B),并且B组的每一行都与紧接其前的A组行有关。例如:
set.seed(1)
df <- data.frame(group = c("A","B","A","B","A","B","B","A","B"),
id = c("A.1","B.1","A.2","B.2","A.3","B.3.1","B.3.2","A.4","B.4"),
score = runif(9,0,1))
组A不能有连续的行。此外,在我的实际数据中,除了每个组B的行都位于与它们相关的组A的正下方之外,没有其他方法可以关联组A和B。
我希望spread
的{{1}}中包含以下列:idA,idB,scoreA,scoreB,以便A组将重复我在data.frame
中拥有的B组映射
因此,在此示例中,生成的df
为:
data.frame
我想这可以通过res.df <- data.frame(idA = c("A.1","A.2","A.3","A.3","A.4"),
idB = c("B.1","B.2","B.3.1","B.3.2","B.4"),
scoreA = df$score[c(1,3,5,5,8)],
scoreA = df$score[c(2,3,6,7,9)])
轻松完成。
有什么主意吗?
答案 0 :(得分:2)
您可以创建一个sub_id
列,该列指示A
组和B
组是否应对齐到同一行,将数据帧分为A df和B df ,然后加入sub_id
列上的两个子数据帧:
df %>%
mutate(sub_id = cumsum(group == 'A')) %>%
{full_join(
filter(., group == 'A') %>% select(-group),
filter(., group == 'B') %>% select(-group),
by = c('sub_id' = 'sub_id'),
suffix = c('A', 'B')
)} %>% select(-sub_id)
# idA scoreA idB scoreB
#1 A.1 0.2655087 B.1 0.3721239
#2 A.2 0.5728534 B.2 0.9082078
#3 A.3 0.2016819 B.3.1 0.8983897
#4 A.3 0.2016819 B.3.2 0.9446753
#5 A.4 0.6607978 B.4 0.6291140
或使用data.table::dcast
支持透视多个值列:
library(data.table); library(zoo)
dcast(
setDT(df)[,
# create a row number column that indicates which row the current row should go to
rn := cumsum(!(group == 'B' & lag(group) == 'A'))
][],
rn ~ group, value.var = c('id', 'score')
)[, `:=` (
id_A = na.locf(id_A),
score_A = na.locf(score_A),
rn = NULL
)][]
# id_A id_B score_A score_B
#1: A.1 B.1 0.2655087 0.3721239
#2: A.2 B.2 0.5728534 0.9082078
#3: A.3 B.3.1 0.2016819 0.8983897
#4: A.3 B.3.2 0.2016819 0.9446753
#5: A.4 B.4 0.6607978 0.6291140