成对收集多个列

时间:2018-02-16 00:24:27

标签: r tidyr

我是R新手。试图整理一个数据集,其中一部分在下面(实际数据集有10000行和列)。我试图通过成对收集它来整理它。

id     start      event1    event2  event2  date1       date2       date2
 1     06/07/2011   A       B       C       06/07/2011  06/07/2011  06/07/2011           
 1     06/07/2011                           NA          NA          NA
 1     06/07/2011   -                       NA          NA          NA
 2     15/07/2011   D       E       A       18/07/2011  18/07/2011  16/07/2011
 3     15/07/2011   D       C       H       19/07/2011  19/07/2011  14/08/2011
 4     22/08/2011   G                       04/09/2011  NA          NA
 4     22/08/2011   -                       NA          NA          NA

我想要实现的目标是:

start        event_date   event   
06/07/2011   06/07/2011   A
06/07/2011   06/07/2011   B
06/07/2011   06/07/2011   C
15/07/2011   18/07/2011   D

依此类推,转换为长格式,保留日期和事件之间的时间链接,并删除所有“非事件”。

3 个答案:

答案 0 :(得分:0)

我不明白你的预期结果。假设它不完整,这就是你要追求的吗?

require(tidyverse);
df %>%
    filter(event1 != "-" & event1 != "") %>%
    group_by(start) %>%
    unite(tmp1, date1, event1) %>%
    unite(tmp2, date2, event2) %>%
    unite(tmp3, date3, event3) %>%
    gather(id, tmp, 3:5) %>%
    separate(tmp, c("event_date", "event"), sep = "_") %>%
    select(start, event_date, event) %>%
    filter(event_date != "NA") %>%
    ungroup() %>%
    arrange(start, event_date);
    ## A tibble: 10 x 3
    #   start      event_date event
    #   <chr>      <chr>      <chr>
    # 1 06/07/2011 06/07/2011 A
    # 2 06/07/2011 06/07/2011 B
    # 3 06/07/2011 06/07/2011 C
    # 4 15/07/2011 14/08/2011 H
    # 5 15/07/2011 16/07/2011 A
    # 6 15/07/2011 18/07/2011 D
    # 7 15/07/2011 18/07/2011 E
    # 8 15/07/2011 19/07/2011 D
    # 9 15/07/2011 19/07/2011 C
    #10 22/08/2011 04/09/2011 G

说明:删除event1为空或"-"的行。按start分组并连接列date1event1等。转换为长表并将连接的条目分隔为event_dateevent。清洁以符合预期结果。

样本数据

require(tidyverse);
df <- read_table(
    "id     start      event1    event2  event3  date1       date2       date3
 1     06/07/2011   A       B       C       06/07/2011  06/07/2011  06/07/2011
 1     06/07/2011                           NA          NA          NA
 1     06/07/2011   -                       NA          NA          NA
 2     15/07/2011   D       E       A       18/07/2011  18/07/2011  16/07/2011
 3     15/07/2011   D       C       H       19/07/2011  19/07/2011  14/08/2011
 4     22/08/2011   G                       04/09/2011  NA          NA
 4     22/08/2011   -                       NA          NA          NA")

答案 1 :(得分:0)

我们可以使用两个gather调用,操作列并过滤掉NA或空字符串的列,按id对列进行排序,最后选择正确的列。

library(dplyr)
library(tidyr)

dat2 <- dat %>%
  gather(label, event, starts_with("event")) %>%
  gather(date, event_date, starts_with("date")) %>%
  mutate_at(vars(label, date), funs(sub("[A-Za-z]*", "", .))) %>%
  filter(label == date, !is.na(event_date), !event %in% "") %>%
  arrange(id) %>%
  select(start, event_date, event)
dat2
#         start event_date event
# 1  06/07/2011 06/07/2011     A
# 2  06/07/2011 06/07/2011     B
# 3  06/07/2011 06/07/2011     C
# 4  15/07/2011 18/07/2011     D
# 5  15/07/2011 18/07/2011     E
# 6  15/07/2011 16/07/2011     A
# 7  15/07/2011 19/07/2011     D
# 8  15/07/2011 19/07/2011     C
# 9  15/07/2011 14/08/2011     H
# 10 22/08/2011 04/09/2011     G

数据

dat <- read.table(text = "id     start      event1    event2  event2  date1       date2       date2
 1     '06/07/2011'   A       B       C       '06/07/2011'  '06/07/2011'  '06/07/2011'           
                  1     '06/07/2011'   ''      ''     ''       NA          NA          NA
                  1     '06/07/2011'   -       ''     ''          NA          NA          NA
                  2     '15/07/2011'   D       E       A       '18/07/2011'  '18/07/2011'  '16/07/2011'
                  3     '15/07/2011'   D       C       H       '19/07/2011'  '19/07/2011'  '14/08/2011'
                  4     '22/08/2011'   G       ''     ''           '04/09/2011'  NA          NA
                  4     '22/08/2011'   -       ''     ''        NA          NA          NA",
                  header = TRUE, stringsAsFactors = FALSE)

答案 2 :(得分:0)

处理(大致)此形式的数据时的一个常见模式是1)收集所有重复列2)将时间指示符中的变量名称分开,3)传播数据,将变量放回列中。 / p>

您的示例中有一些非标准方面,因此我首先假设您实际上有三次重复测量,并创建一个唯一的id变量。

dat <- read.table(text = "id     start      event1    event2  event3  date1       date2       date3
1     '06/07/2011'   A       B       C       '06/07/2011'  '06/07/2011'  '06/07/2011'           
1     '06/07/2011'   ''      ''     ''       NA          NA          NA
1     '06/07/2011'   -       ''     ''          NA          NA          NA
2     '15/07/2011'   D       E       A       '18/07/2011'  '18/07/2011'  '16/07/2011'
3     '15/07/2011'   D       C       H       '19/07/2011'  '19/07/2011'  '14/08/2011'
4     '22/08/2011'   G       ''     ''           '04/09/2011'  NA          NA
4     '22/08/2011'   -       ''     ''        NA          NA          NA",
 header = TRUE, stringsAsFactors = FALSE, na = c("", "NA"))

dat$rowid <- 1:nrow(dat)
names(dat) <- gsub("([a-z])([0-9])", "\\1_\\2", names(dat))
dat
##   id      start event_1 event_2 event_3     date_1     date_2     date_3 rowid
## 1  1 06/07/2011       A       B       C 06/07/2011 06/07/2011 06/07/2011     1
## 2  1 06/07/2011    <NA>    <NA>    <NA>       <NA>       <NA>       <NA>     2
## 3  1 06/07/2011       -    <NA>    <NA>       <NA>       <NA>       <NA>     3
## 4  2 15/07/2011       D       E       A 18/07/2011 18/07/2011 16/07/2011     4
## 5  3 15/07/2011       D       C       H 19/07/2011 19/07/2011 14/08/2011     5
## 6  4 22/08/2011       G    <NA>    <NA> 04/09/2011       <NA>       <NA>     6
## 7  4 22/08/2011       -    <NA>    <NA>       <NA>       <NA>       <NA>     7

从这里开始,按照上面列出的步骤,该过程可以照常进行:

library(tidyr)

dat <- gather(dat,
              key = "var",
              value = "value",
              -id, -rowid, -start)
dat <- separate(dat, var, into = c("var", "which"), sep = "_")
dat <- spread(dat, key = var, value = value)

最后一次清理并完成了我们的工作:

dat <- na.omit(dat)[ , setdiff(names(dat), c("rowid", "which"))]
dat
##    id      start       date event
## 1   1 06/07/2011 06/07/2011     A
## 2   1 06/07/2011 06/07/2011     B
## 3   1 06/07/2011 06/07/2011     C
## 10  2 15/07/2011 18/07/2011     D
## 11  2 15/07/2011 18/07/2011     E
## 12  2 15/07/2011 16/07/2011     A
## 13  3 15/07/2011 19/07/2011     D
## 14  3 15/07/2011 19/07/2011     C
## 15  3 15/07/2011 14/08/2011     H
## 16  4 22/08/2011 04/09/2011     G