将重复行重新整形为列标题

时间:2017-08-13 11:06:19

标签: r dataframe dplyr tidyr

我正在尝试使用tidyR重塑数据帧.Below是数据帧:

data <- data.frame(class_name=c("date","date","educational","qualif","date","date",               "educational","qualif"),
        text_val=c("2000","2003","ILLINOIS INSTITUTE OF TECHNOLOGY",
           "Master of Science, Computer Science","1996","2000",
           "MAHARASHTRA INSTITUTE OF TECHNOLOGY",
           "Bachelor of Science, Mechanical Engineering"))

我希望数据看起来如下图所示:

1

3 个答案:

答案 0 :(得分:3)

这是使用tidyverse的想法。我们基本上每4行分组并传播。但是,我们需要首先使class_name中的名称唯一,即

library(tidyverse)

data %>% 
    group_by(grp = rep(seq(n()/4), each = 4)) %>% 
    mutate(class_name = make.unique(as.character(class_name))) %>% 
    spread(class_name, text_val) %>% 
    ungroup() %>% 
    select(educational, qualif, date, date.1)

由此给出,

# A tibble: 2 x 4
                          educational                                      qualif   date date.1
*                              <fctr>                                      <fctr> <fctr> <fctr>
1    ILLINOIS INSTITUTE OF TECHNOLOGY         Master of Science, Computer Science   2000   2003
2 MAHARASHTRA INSTITUTE OF TECHNOLOGY Bachelor of Science, Mechanical Engineering   1996   2000

答案 1 :(得分:1)

使用reshape的另一种解决方案(不如Sotos&#39;解决方案优雅):

data <- data.frame(class_name=c("date","date","educational","qualif","date","date",               "educational","qualif"),
        text_val=c("2000","2003","ILLINOIS INSTITUTE OF TECHNOLOGY",
           "Master of Science, Computer Science","1996","2000",
           "MAHARASHTRA INSTITUTE OF TECHNOLOGY",
           "Bachelor of Science, Mechanical Engineering"))
nrec <- 4
data$id <- rep(1:2, each=nrec)
data$time <- rep(1:4, nrow(data)/nrec)

df <- reshape(data, v.names="text_val", idvar="id", direction="wide")[,-1]
names(df) <- c("id","date1","date2","educational","qualif")
df

#   id date1 date2                         educational                                      qualif
# 1  1  2000  2003    ILLINOIS INSTITUTE OF TECHNOLOGY         Master of Science, Computer Science
# 5  2  1996  2000 MAHARASHTRA INSTITUTE OF TECHNOLOGY Bachelor of Science, Mechanical Engineering

答案 2 :(得分:0)

为了完整起见,这里也是使用dcast()包中的data.table的解决方案:

library(data.table)
setDT(data)[, rn := .I + 3L][
  , dcast(.SD , rn %/% 4L ~ class_name, toString, value.var = "text_val")]
   rn       date                         educational                                      qualif
1:  1 2000, 2003    ILLINOIS INSTITUTE OF TECHNOLOGY         Master of Science, Computer Science
2:  2 1996, 2000 MAHARASHTRA INSTITUTE OF TECHNOLOGY Bachelor of Science, Mechanical Engineering

请注意,toString()用作聚合函数,以便重复日期在一列中连接。这是因为OP的预期输出中的两个date列共享相同的名称,这可能表示预期的输出仅用于显示,并且不需要进一步处理date值。< / p>

如果列顺序很重要且不需要rn,则可以美化输出以更好地匹配OP的预期结果:

lvl <- c("educational", "qualif", "date")
setDT(data)[, rn := .I + 3L][, class_name := factor(class_name, levels = lvl)][
  , dcast(.SD , rn %/% 4L ~ class_name, toString, value.var = "text_val")][, rn := NULL][]
                           educational                                      qualif       date
1:    ILLINOIS INSTITUTE OF TECHNOLOGY         Master of Science, Computer Science 2000, 2003
2: MAHARASHTRA INSTITUTE OF TECHNOLOGY Bachelor of Science, Mechanical Engineering 1996, 2000