将行变成列并获取最新记录-使用R

时间:2018-09-03 12:56:10

标签: r data-science data-manipulation

我在尝试将行转换为列然后仅获取最新记录(使用时间戳)时遇到问题。这是我的数据集:

enter image description here

下面是生成此数据集的代码:

df <- data.frame(id = c("123||wa", "123||wa", "123||wa", "223||sa", "223||sa", "223||sa", "123||wa"),
               questions = c("dish", "car", "house", "dish", "house", "car", "dish"),
               answers = c("pasta", "bmw", "yes", "pizza", "yes", "audi","ravioli" ), 
               timestamp = c("03JUL2014:15:38:11", "07JAN2015:15:22:54", "24MAR2018:12:24:16", "24MAR2018:12:24:16",
               "04AUG2014:12:40:30", "03JUL2014:15:38:11", "05FEB2018:17:23:16"))

所需的输出是:

enter image description here

生成输出的代码:

output <- data.frame(id = c("123||wa", "223||sa"), dish = c("ravioli", "pizza"), 
                 car = c("bmw", "audi"), house = c("yes", "yes"))

注意:如您在原始数据集中看到的,id字段有多行。更重要的是,关于他们喜欢的菜,id '123 || wa'有两行,但最终输出中只需要他们的最新答案。

任何帮助将不胜感激。谢谢

2 个答案:

答案 0 :(得分:2)

您可以使用tidyr和dplyr库:首先通过获取最后一个答案进行总结,然后转换data.frame:

output <-   df%>%
arrange(id, timestamp) %>%
group_by(id, questions)%>%
summarise(last=last(answers))%>%
spread(questions, last)

答案 1 :(得分:2)

最有可能首先将date_time列转换为正确的类型(此处使用ymd_hmslubridate中的strptime),因为提取的值应对应于date_time的最新记录。之后,dplyr的几个功能就派上用场了

library(lubridate)
library(dplyr)
df %>%
  mutate(timestamp = ymd_hms(strptime(timestamp, "%d%b%Y:%H:%M:%S"))) %>%
  group_by(id, questions) %>%
  arrange(timestamp) %>%
  summarise(last = last(answers)) %>%
  spread(questions, last)

#output
# A tibble: 2 x 4
# Groups: id [2]
  id      car   dish    house
* <fct>   <fct> <fct>   <fct>
1 123||wa bmw   ravioli yes  
2 223||sa audi  pizza   yes  

ymd_hms(strptime(...部分可以替换为:

mutate(timestamp = parse_date_time(timestamp,  orders = "%d%b%Y:%H:%M:%S"))

请参见

?strptime

关于如何构造date_time格式