我是R新手。试图整理一个数据集,其中一部分在下面(实际数据集有10000行和列)。我试图通过成对收集它来整理它。
id start event1 event2 event2 date1 date2 date2
1 06/07/2011 A B C 06/07/2011 06/07/2011 06/07/2011
1 06/07/2011 NA NA NA
1 06/07/2011 - NA NA NA
2 15/07/2011 D E A 18/07/2011 18/07/2011 16/07/2011
3 15/07/2011 D C H 19/07/2011 19/07/2011 14/08/2011
4 22/08/2011 G 04/09/2011 NA NA
4 22/08/2011 - NA NA NA
我想要实现的目标是:
start event_date event
06/07/2011 06/07/2011 A
06/07/2011 06/07/2011 B
06/07/2011 06/07/2011 C
15/07/2011 18/07/2011 D
依此类推,转换为长格式,保留日期和事件之间的时间链接,并删除所有“非事件”。
答案 0 :(得分:0)
我不明白你的预期结果。假设它不完整,这就是你要追求的吗?
require(tidyverse);
df %>%
filter(event1 != "-" & event1 != "") %>%
group_by(start) %>%
unite(tmp1, date1, event1) %>%
unite(tmp2, date2, event2) %>%
unite(tmp3, date3, event3) %>%
gather(id, tmp, 3:5) %>%
separate(tmp, c("event_date", "event"), sep = "_") %>%
select(start, event_date, event) %>%
filter(event_date != "NA") %>%
ungroup() %>%
arrange(start, event_date);
## A tibble: 10 x 3
# start event_date event
# <chr> <chr> <chr>
# 1 06/07/2011 06/07/2011 A
# 2 06/07/2011 06/07/2011 B
# 3 06/07/2011 06/07/2011 C
# 4 15/07/2011 14/08/2011 H
# 5 15/07/2011 16/07/2011 A
# 6 15/07/2011 18/07/2011 D
# 7 15/07/2011 18/07/2011 E
# 8 15/07/2011 19/07/2011 D
# 9 15/07/2011 19/07/2011 C
#10 22/08/2011 04/09/2011 G
说明:删除event1
为空或"-"
的行。按start
分组并连接列date1
,event1
等。转换为长表并将连接的条目分隔为event_date
和event
。清洁以符合预期结果。
require(tidyverse);
df <- read_table(
"id start event1 event2 event3 date1 date2 date3
1 06/07/2011 A B C 06/07/2011 06/07/2011 06/07/2011
1 06/07/2011 NA NA NA
1 06/07/2011 - NA NA NA
2 15/07/2011 D E A 18/07/2011 18/07/2011 16/07/2011
3 15/07/2011 D C H 19/07/2011 19/07/2011 14/08/2011
4 22/08/2011 G 04/09/2011 NA NA
4 22/08/2011 - NA NA NA")
答案 1 :(得分:0)
我们可以使用两个gather
调用,操作列并过滤掉NA
或空字符串的列,按id
对列进行排序,最后选择正确的列。
library(dplyr)
library(tidyr)
dat2 <- dat %>%
gather(label, event, starts_with("event")) %>%
gather(date, event_date, starts_with("date")) %>%
mutate_at(vars(label, date), funs(sub("[A-Za-z]*", "", .))) %>%
filter(label == date, !is.na(event_date), !event %in% "") %>%
arrange(id) %>%
select(start, event_date, event)
dat2
# start event_date event
# 1 06/07/2011 06/07/2011 A
# 2 06/07/2011 06/07/2011 B
# 3 06/07/2011 06/07/2011 C
# 4 15/07/2011 18/07/2011 D
# 5 15/07/2011 18/07/2011 E
# 6 15/07/2011 16/07/2011 A
# 7 15/07/2011 19/07/2011 D
# 8 15/07/2011 19/07/2011 C
# 9 15/07/2011 14/08/2011 H
# 10 22/08/2011 04/09/2011 G
数据强>
dat <- read.table(text = "id start event1 event2 event2 date1 date2 date2
1 '06/07/2011' A B C '06/07/2011' '06/07/2011' '06/07/2011'
1 '06/07/2011' '' '' '' NA NA NA
1 '06/07/2011' - '' '' NA NA NA
2 '15/07/2011' D E A '18/07/2011' '18/07/2011' '16/07/2011'
3 '15/07/2011' D C H '19/07/2011' '19/07/2011' '14/08/2011'
4 '22/08/2011' G '' '' '04/09/2011' NA NA
4 '22/08/2011' - '' '' NA NA NA",
header = TRUE, stringsAsFactors = FALSE)
答案 2 :(得分:0)
处理(大致)此形式的数据时的一个常见模式是1)收集所有重复列2)将时间指示符中的变量名称分开,3)传播数据,将变量放回列中。 / p>
您的示例中有一些非标准方面,因此我首先假设您实际上有三次重复测量,并创建一个唯一的id变量。
dat <- read.table(text = "id start event1 event2 event3 date1 date2 date3
1 '06/07/2011' A B C '06/07/2011' '06/07/2011' '06/07/2011'
1 '06/07/2011' '' '' '' NA NA NA
1 '06/07/2011' - '' '' NA NA NA
2 '15/07/2011' D E A '18/07/2011' '18/07/2011' '16/07/2011'
3 '15/07/2011' D C H '19/07/2011' '19/07/2011' '14/08/2011'
4 '22/08/2011' G '' '' '04/09/2011' NA NA
4 '22/08/2011' - '' '' NA NA NA",
header = TRUE, stringsAsFactors = FALSE, na = c("", "NA"))
dat$rowid <- 1:nrow(dat)
names(dat) <- gsub("([a-z])([0-9])", "\\1_\\2", names(dat))
dat
## id start event_1 event_2 event_3 date_1 date_2 date_3 rowid
## 1 1 06/07/2011 A B C 06/07/2011 06/07/2011 06/07/2011 1
## 2 1 06/07/2011 <NA> <NA> <NA> <NA> <NA> <NA> 2
## 3 1 06/07/2011 - <NA> <NA> <NA> <NA> <NA> 3
## 4 2 15/07/2011 D E A 18/07/2011 18/07/2011 16/07/2011 4
## 5 3 15/07/2011 D C H 19/07/2011 19/07/2011 14/08/2011 5
## 6 4 22/08/2011 G <NA> <NA> 04/09/2011 <NA> <NA> 6
## 7 4 22/08/2011 - <NA> <NA> <NA> <NA> <NA> 7
从这里开始,按照上面列出的步骤,该过程可以照常进行:
library(tidyr)
dat <- gather(dat,
key = "var",
value = "value",
-id, -rowid, -start)
dat <- separate(dat, var, into = c("var", "which"), sep = "_")
dat <- spread(dat, key = var, value = value)
最后一次清理并完成了我们的工作:
dat <- na.omit(dat)[ , setdiff(names(dat), c("rowid", "which"))]
dat
## id start date event
## 1 1 06/07/2011 06/07/2011 A
## 2 1 06/07/2011 06/07/2011 B
## 3 1 06/07/2011 06/07/2011 C
## 10 2 15/07/2011 18/07/2011 D
## 11 2 15/07/2011 18/07/2011 E
## 12 2 15/07/2011 16/07/2011 A
## 13 3 15/07/2011 19/07/2011 D
## 14 3 15/07/2011 19/07/2011 C
## 15 3 15/07/2011 14/08/2011 H
## 16 4 22/08/2011 04/09/2011 G