我在尝试将行转换为列然后仅获取最新记录(使用时间戳)时遇到问题。这是我的数据集:
下面是生成此数据集的代码:
df <- data.frame(id = c("123||wa", "123||wa", "123||wa", "223||sa", "223||sa", "223||sa", "123||wa"),
questions = c("dish", "car", "house", "dish", "house", "car", "dish"),
answers = c("pasta", "bmw", "yes", "pizza", "yes", "audi","ravioli" ),
timestamp = c("03JUL2014:15:38:11", "07JAN2015:15:22:54", "24MAR2018:12:24:16", "24MAR2018:12:24:16",
"04AUG2014:12:40:30", "03JUL2014:15:38:11", "05FEB2018:17:23:16"))
所需的输出是:
生成输出的代码:
output <- data.frame(id = c("123||wa", "223||sa"), dish = c("ravioli", "pizza"),
car = c("bmw", "audi"), house = c("yes", "yes"))
注意:如您在原始数据集中看到的,id字段有多行。更重要的是,关于他们喜欢的菜,id '123 || wa'有两行,但最终输出中只需要他们的最新答案。
任何帮助将不胜感激。谢谢
答案 0 :(得分:2)
您可以使用tidyr和dplyr库:首先通过获取最后一个答案进行总结,然后转换data.frame:
output <- df%>%
arrange(id, timestamp) %>%
group_by(id, questions)%>%
summarise(last=last(answers))%>%
spread(questions, last)
答案 1 :(得分:2)
最有可能首先将date_time列转换为正确的类型(此处使用ymd_hms
和lubridate
中的strptime
),因为提取的值应对应于date_time的最新记录。之后,dplyr
的几个功能就派上用场了
library(lubridate)
library(dplyr)
df %>%
mutate(timestamp = ymd_hms(strptime(timestamp, "%d%b%Y:%H:%M:%S"))) %>%
group_by(id, questions) %>%
arrange(timestamp) %>%
summarise(last = last(answers)) %>%
spread(questions, last)
#output
# A tibble: 2 x 4
# Groups: id [2]
id car dish house
* <fct> <fct> <fct> <fct>
1 123||wa bmw ravioli yes
2 223||sa audi pizza yes
ymd_hms(strptime(...
部分可以替换为:
mutate(timestamp = parse_date_time(timestamp, orders = "%d%b%Y:%H:%M:%S"))
请参见
?strptime
关于如何构造date_time格式