重塑R中的data.frame

时间:2018-01-12 10:28:24

标签: r reshape

我在R中有一个data.frame。

df = data.frame(custid=c(1,2,3,4),prod1=c('jeans','tshirt','jacket','tshirt'),prod1_hnode1=c(1,2,3,2),prod1_hnode2=c(6,7,8,7),prod2=c('tshirt','jeans','jacket','shirt'),prod2_hnode1=c(2,1,3,4),prod2_hnode2=c(7,6,8,7))

> df
  custid  prod1 prod1_hnode1 prod1_hnode2  prod2 prod2_hnode1 prod2_hnode2
1      1  jeans            1            6 tshirt            2            7
2      2 tshirt            2            7  jeans            1            6
3      3 jacket            3            8 jacket            3            8
4      4 tshirt            2            7  shirt            4            7

我怎样才能重塑它?

  custid  prod    rec hnode1 hnode2
1      1 prod1  jeans      1      6
2      1 prod2 tshirt      2      7
3      2 prod1 tshirt      2      7
4      2 prod2  jeans      1      6
5      3 prod1 jacket      3      8
6      3 prod2 jacket      3      8
7      4 prod1 tshirt      2      7
8      4 prod2  shirt      4      7

刚刚在python here中解答了如何执行此操作。对R解决方案也很好奇。

2 个答案:

答案 0 :(得分:3)

我们可以使用melt

中的data.table执行此操作
library(data.table)
melt(setDT(df), measure = patterns("^prod\\d+$", "hnode1", "hnode2"), 
    value.name = c("rec", "hnode1", "hnode2"), variable.name = 'prod')[, 
        prod := paste0("prod", prod)][order(custid)]
#    custid  prod    rec hnode1 hnode2
#1:      1 prod1  jeans      1      6
#2:      1 prod2 tshirt      2      7
#3:      2 prod1 tshirt      2      7
#4:      2 prod2  jeans      1      6
#5:      3 prod1 jacket      3      8
#6:      3 prod2 jacket      3      8
#7:      4 prod1 tshirt      2      7
#8:      4 prod2  shirt      4      7

答案 1 :(得分:1)

另一种方法是使用基础R的reshape函数。

尝试:

long <- reshape(df, direction = "long", idvar = "custid", 
                varying = list(c(2, 5), c(3, 6), c(4, 7)), 
                sep = "", times = c("prod1", "prod2"))

此时你已经完成了很多工作,但你也可以查看你的行名和列名:

rownames(long) <- NULL    
colnames(long) <- c("custid", "prod", "rec", "hnode1", "hnode2")
long
#   custid  prod    rec hnode1 hnode2
# 1      1 prod1  jeans      1      6
# 2      2 prod1 tshirt      2      7
# 3      3 prod1 jacket      3      8
# 4      4 prod1 tshirt      2      7
# 5      1 prod2 tshirt      2      7
# 6      2 prod2  jeans      1      6
# 7      3 prod2 jacket      3      8
# 8      4 prod2  shirt      4      7

我真的不能想到一个不涉及组合两个数据子集的“tidyverse”方法。这是获得所需输出的内容:

library(tidyverse)

left <- df %>% 
  select(custid, prod1, prod2) %>%
  gather(prod, rec, -custid) %>%
  arrange(custid) 

right <- df %>%
  select(custid, contains("node")) %>%
  gather(var, val, -custid) %>%
  mutate(var = sub(".*_", "", var)) %>%
  group_by(custid, var) %>%
  mutate(ind = sequence(n())) %>%
  spread(var, val) %>%
  ungroup() %>%
  select(-ind, -custid)

cbind(left, right)