根据R

时间:2017-10-07 08:57:54

标签: r dplyr

我想基于滞后观察创建二元/指标变量。我有一个变量X1。原始数据如下所示。它是一个示例数据。原始数据接近10K记录。

X1
Diagnosis
1
2
3
4
Treatment
1
2
3

我希望输出看起来像这样:

X1           NewVar
Diagnosis    Diagnosis
1            Diagnosis
2            Diagnosis
3            Diagnosis 
4            Diagnosis 
Treatment    Treatment 
1            Treatment  
2            Treatment
3            Treatment  

任何帮助都将受到高度赞赏!

2 个答案:

答案 0 :(得分:1)

您可以使用cumsum实现此目的。每次出现cumsumDiagnosis时,Treatment都可以创建新的论坛。然后,每个组中的NewVar将获取此组中第一个X1的值:

library(dplyr)

dtf %>%
    mutate(g = cumsum(X1 == 'Diagnosis' | X1 == 'Treatment')) %>%
    group_by(g) %>%
    mutate(NewVar = X1[1]) %>%
    ungroup() %>% select(-g)
# # A tibble: 9 x 2
#          X1    NewVar
# <fctr>    <fctr>
# 1 Diagnosis Diagnosis
# 2         1 Diagnosis
# 3         2 Diagnosis
# 4         3 Diagnosis
# 5         4 Diagnosis
# 6 Treatment Treatment
# 7         1 Treatment
# 8         2 Treatment
# 9         3 Treatment

上面代码中的dtf

> dput(dtf)
structure(list(X1 = structure(c(5L, 1L, 2L, 3L, 4L, 6L, 1L, 2L, 
3L), .Label = c("1", "2", "3", "4", "Diagnosis", "Treatment"), class = "factor")), .Names = "X1", class = "data.frame", row.names = c(NA, 
-9L))

答案 1 :(得分:0)

以下是data.table的选项。转换为&data; data.table&#39; (setDT(dtf),将基于&#39; X1&#39;值的逻辑向量的累积和作为字符,并将&#39; NewVar&#39;作为&#39; X1&#39;的第一个元素。 (X1[1]

library(data.table)
setDT(dtf)[,  NewVar := X1[1], cumsum(grepl('^[A-Za-z]+$', X1))]
dtf
#          X1    NewVar
#1: Diagnosis Diagnosis
#2:         1 Diagnosis
#3:         2 Diagnosis
#4:         3 Diagnosis
#5:         4 Diagnosis
#6: Treatment Treatment
#7:         1 Treatment
#8:         2 Treatment
#9:         3 Treatment