我想基于滞后观察创建二元/指标变量。我有一个变量X1。原始数据如下所示。它是一个示例数据。原始数据接近10K记录。
X1
Diagnosis
1
2
3
4
Treatment
1
2
3
我希望输出看起来像这样:
X1 NewVar
Diagnosis Diagnosis
1 Diagnosis
2 Diagnosis
3 Diagnosis
4 Diagnosis
Treatment Treatment
1 Treatment
2 Treatment
3 Treatment
任何帮助都将受到高度赞赏!
答案 0 :(得分:1)
您可以使用cumsum
实现此目的。每次出现cumsum
或Diagnosis
时,Treatment
都可以创建新的论坛。然后,每个组中的NewVar
将获取此组中第一个X1
的值:
library(dplyr)
dtf %>%
mutate(g = cumsum(X1 == 'Diagnosis' | X1 == 'Treatment')) %>%
group_by(g) %>%
mutate(NewVar = X1[1]) %>%
ungroup() %>% select(-g)
# # A tibble: 9 x 2
# X1 NewVar
# <fctr> <fctr>
# 1 Diagnosis Diagnosis
# 2 1 Diagnosis
# 3 2 Diagnosis
# 4 3 Diagnosis
# 5 4 Diagnosis
# 6 Treatment Treatment
# 7 1 Treatment
# 8 2 Treatment
# 9 3 Treatment
上面代码中的dtf
:
> dput(dtf)
structure(list(X1 = structure(c(5L, 1L, 2L, 3L, 4L, 6L, 1L, 2L,
3L), .Label = c("1", "2", "3", "4", "Diagnosis", "Treatment"), class = "factor")), .Names = "X1", class = "data.frame", row.names = c(NA,
-9L))
答案 1 :(得分:0)
以下是data.table
的选项。转换为&data; data.table&#39; (setDT(dtf)
,将基于&#39; X1&#39;值的逻辑向量的累积和作为字符,并将&#39; NewVar&#39;作为&#39; X1&#39;的第一个元素。 (X1[1]
)
library(data.table)
setDT(dtf)[, NewVar := X1[1], cumsum(grepl('^[A-Za-z]+$', X1))]
dtf
# X1 NewVar
#1: Diagnosis Diagnosis
#2: 1 Diagnosis
#3: 2 Diagnosis
#4: 3 Diagnosis
#5: 4 Diagnosis
#6: Treatment Treatment
#7: 1 Treatment
#8: 2 Treatment
#9: 3 Treatment