我刚刚发布了一个question,最近询问如何将数据从长表重新整形为宽表。然后我发现spread()
是一个非常方便的功能。所以现在我需要在我之前的帖子上进一步开发。
假设我们有一个这样的表:
id1 | id2 | info | action_time | action_comment |
1 | a | info1 | time1 | comment1 |
1 | a | info1 | time2 | comment2 |
1 | a | info1 | time3 | comment3 |
2 | b | info2 | time4 | comment4 |
2 | b | info2 | time5 | comment5 |
我想把它改成这样的东西:
id1 | id2 | info |action_time 1|action_comment1 |action_time 2|action_comment2 |action_time 3|action_comment3 |
1 | a | info1 | time1 | comment1 | time2 | comment2 | time3 | comment3 |
2 | b | info2 | time4 | comment4 | time5 | comment5 | | |
所以这个问题和我之前的问题之间的区别是我添加了另一个专栏,我也需要重新整理。
我正在考虑使用
library(dplyr)
library(tidyr)
df %>%
group_by(id1) %>%
mutate(action_no = paste("action_time", row_number())) %>%
spread(action_no, value = c(action_time, action_comment))
但是当我在value
参数中添加两个值时,它给出了一条错误消息:列规范无效。
我真的很喜欢使用这样的%>%
运算符来操作数据,所以我很想知道如何纠正我的代码以实现这一点。
非常感谢帮助
答案 0 :(得分:8)
我们可以使用data.table
的devel版本执行此操作,该版本可能需要多个value.var
列。安装devel版本的说明是here
我们转换了' data.frame'到' data.table' (setDT(df)
),使用分组变量创建序列变量(' ind')' id1',' id2',' info& #39;)和dcast
来自' long'广泛的'通过将value.var
指定为' action_time'来格式化和' action_comment'。
library(data.table)#v1.9.5+
setDT(df)[, ind:= 1:.N, .(id1, id2, info)]
dcast(df, id1 + id2 + info ~ ind,
value.var=c('action_time', 'action_comment'), fill='')
# id1 id2 info 1_action_time 2_action_time 3_action_time 1_action_comment
#1: 1 a info1 time1 time2 time3 comment1
#2: 2 b info2 time4 time5 comment4
# 2_action_comment 3_action_comment
#1: comment2 comment3
#2: comment5
或使用reshape
中的base R
。我们使用ave
和reshape
创建序列变量(' ind')以更改' long'广泛的'格式。
df$ind <- with(df, ave(seq_along(id1), id1, id2, info, FUN=seq_along))
reshape(df, idvar=c('id1', 'id2', 'info'),timevar='ind', direction='wide')
# id1 id2 info action_time.1 action_comment.1 action_time.2 action_comment.2
#1 1 a info1 time1 comment1 time2 comment2
#4 2 b info2 time4 comment4 time5 comment5
# action_time.3 action_comment.3
#1 time3 comment3
#4 <NA> <NA>
df <- structure(list(id1 = c(1L, 1L, 1L, 2L, 2L), id2 = c("a", "a",
"a", "b", "b"), info = c("info1", "info1", "info1", "info2",
"info2"), action_time = c("time1", "time2", "time3", "time4",
"time5"), action_comment = c("comment1", "comment2", "comment3",
"comment4", "comment5")), .Names = c("id1", "id2", "info", "action_time",
"action_comment"), class = "data.frame", row.names = c(NA, -5L))
答案 1 :(得分:6)
尝试:
.
给出了:
library(dplyr)
library(tidyr)
df %>%
group_by(id1) %>%
mutate(id = row_number()) %>%
gather(key, value, -(id1:info), -id) %>%
unite(id_key, id, key) %>%
spread(id_key, value)
答案 2 :(得分:2)
不是直接的解决方案,但有效
library(tidyr)
a = spread(df, action_comment, action_time);
b = spread(df, action_time, action_comment);
# dropping NAs and shifting the values to left row wise
a[] = t(apply(a, 1, function(x) `length<-`(na.omit(x), length(x))))
b[] = t(apply(b, 1, function(x) `length<-`(na.omit(x), length(x))))
out = merge(a,b, by = c('id1','id2','info'))
out[, colSums(is.na(out)) != nrow(out)]
# id1 id2 info comment1 comment2 comment3 time1 time2 time3
#1 1 a info1 time1 time2 time3 comment1 comment2 comment3
#2 2 b info2 time4 time5 <NA> comment4 comment5 <NA>