数据如下所示。
time <- c('Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000',
'Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000', 'Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000')
A <- c('20.79','NA','NA','NA','21.8','NA','NA')
B <- c('NA','97.017','94.321','85.014','NA','87.1','67.1')
C <- c('NA','C1','C2','C3','NA','C1','C2')
D <- c('L1','L1','L1','L1','L2','L2','L2')
C1 <- c('NA','NA','NA','NA','NA','NA','NA')
C2 <- c('NA','NA','NA','NA','NA','NA','NA')
C3 <- c('NA','NA','NA','NA','NA','NA','NA')
df <- data.frame(time,A,B,C,D,C1,C2,C3)
我需要以下格式的输出。
# time A B C D C1 C2 C3
# 1 Nov 1st 2014, 17:36:50.000 20.79 NA NA L1 97.02 94.321 85.014
Nov 1st 2014, 17:37:50.000 21.8 NA NA L2 87.1 67.1 47.3
由于所有行的“时间”和“ D”列都相同,我如何只在一行中获得上述格式的数据?
谢谢!
答案 0 :(得分:1)
您可以使用dplyr::gather()
将B重塑为C1,C2,C3,然后将dplyr::join()
重塑为其他列,并假定其唯一的日期/时间。
library(dplyr)
library(tidyr)
df %>%
select(time, A, B, C, D) %>%
filter(!is.na(A)) %>%
left_join(
df %>%
select(time, C, B, D) %>%
spread(C, B) %>%
select(-`<NA>`),
by = c("time", "D")
)
# time A B C D C1 C2 C3
# 1 Nov 1st 2014, 17:36:50.000 20.79 NA <NA> L1 97.017 94.321 85.014
# 2 Nov 1st 2014, 17:37:50.000 21.80 NA <NA> L2 87.100 67.100 47.300
df <- read.table(text = "time A B C D C1 C2 C3
1 'Nov 1st 2014, 17:36:50.000' 20.79 NA NA L1 NA NA NA
2 'Nov 1st 2014, 17:36:50.000' NA 97.017 C1 L1 NA NA NA
3 'Nov 1st 2014, 17:36:50.000' NA 94.321 C2 L1 NA NA NA
4 'Nov 1st 2014, 17:36:50.000' NA 85.014 C3 L1 NA NA NA
5 'Nov 1st 2014, 17:37:50.000' 21.8 NA NA L2 NA NA NA
6 'Nov 1st 2014, 17:37:50.000' NA 87.1 C1 L2 NA NA NA
7 'Nov 1st 2014, 17:37:50.000' NA 67.1 C2 L2 NA NA NA
8 'Nov 1st 2014, 17:37:50.000' NA 47.3 C3 L2 NA NA NA",
header = T,
stringsAsFactors = F)
答案 1 :(得分:0)
如果我理解正确,OP的数据集实际上包含两个混合的数据集:
df
time A B C D C1 C2 C3 1 Nov 1st 2014, 17:36:50.000 20.79 NA NA L1 NA NA NA 2 Nov 1st 2014, 17:36:50.000 NA 97.017 C1 L1 NA NA NA 3 Nov 1st 2014, 17:36:50.000 NA 94.321 C2 L1 NA NA NA 4 Nov 1st 2014, 17:36:50.000 NA 85.014 C3 L1 NA NA NA 5 Nov 1st 2014, 17:37:50.000 21.8 NA NA L2 NA NA NA 6 Nov 1st 2014, 17:37:50.000 NA 87.1 C1 L2 NA NA NA 7 Nov 1st 2014, 17:37:50.000 NA 67.1 C2 L2 NA NA NA
需要分开的地方:
library(data.table)
df1 <- setDT(df)[A != "NA", .(time, A, D)]
df1
time A D 1: Nov 1st 2014, 17:36:50.000 20.79 L1 2: Nov 1st 2014, 17:37:50.000 21.8 L2
和
df2 <- df[A == "NA", .(time, B, C, D)]
df2
time B C D 1: Nov 1st 2014, 17:36:50.000 97.017 C1 L1 2: Nov 1st 2014, 17:36:50.000 94.321 C2 L1 3: Nov 1st 2014, 17:36:50.000 85.014 C3 L1 4: Nov 1st 2014, 17:37:50.000 87.1 C1 L2 5: Nov 1st 2014, 17:37:50.000 67.1 C2 L2
标识行的唯一子集的关键列是time
和D
。删除列C1
,C2
和C3
,因为它们将在下一步中创建。
第二个数据集将从长格式更改为宽格式:
wide <- dcast(df2, time + D ~ C, value.var = "B")
wide
time D C1 C2 C3 1: Nov 1st 2014, 17:36:50.000 L1 97.017 94.321 85.014 2: Nov 1st 2014, 17:37:50.000 L2 87.1 67.1 <NA>
现在可以将两个部分结果结合在一起:
df1[wide, on = .(time, D)]
time A D C1 C2 C3 1: Nov 1st 2014, 17:36:50.000 20.79 L1 97.017 94.321 85.014 2: Nov 1st 2014, 17:37:50.000 21.8 L2 87.1 67.1 <NA>
请注意,由于B
和C
列未传达任何信息,因此已将它们从结果中删除。
上述步骤可以合并为更少的语句:
library(data.table)
setDT(df)[, (paste0("C", 1:3)) := NULL]
df[A != "NA"][dcast(df[C != "NA"], time + D ~ C, value.var = "B"), on = .(time, D)]
time A B C D C1 C2 C3 1: Nov 1st 2014, 17:36:50.000 20.79 NA NA L1 97.017 94.321 85.014 2: Nov 1st 2014, 17:37:50.000 21.8 NA NA L2 87.1 67.1 <NA>
由OP提供,NA值以字符串形式给出
time <- c('Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000',
'Nov 1st 2014, 17:36:50.000','Nov 1st 2014, 17:36:50.000', 'Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000','Nov 1st 2014, 17:37:50.000')
A <- c('20.79','NA','NA','NA','21.8','NA','NA')
B <- c('NA','97.017','94.321','85.014','NA','87.1','67.1')
C <- c('NA','C1','C2','C3','NA','C1','C2')
D <- c('L1','L1','L1','L1','L2','L2','L2')
C1 <- c('NA','NA','NA','NA','NA','NA','NA')
C2 <- c('NA','NA','NA','NA','NA','NA','NA')
C3 <- c('NA','NA','NA','NA','NA','NA','NA')
df <- data.frame(time,A,B,C,D,C1,C2,C3)