使用多个键和值重塑数据框

时间:2018-03-29 05:19:59

标签: r

我有一个宽数据框,我想将其转换为长数据帧。

这不是我正在使用的实际宽数据帧。每门课程还有更多的课程和更多的“价值观”,因此数据框架比这更广泛。并非每个课程都有与之关联的所有值列(因此,为什么bio1Csem不在下面的数据框中)。

我决定尝试在较小的数据帧上使用解决方案,因为我在较大的数据帧上遇到了很多问题。不幸的是,我还在苦苦挣扎。

我正在使用的数据框:

>X = rbind( c( "1", "2.5","3.7","","2006 Fall","2007 Fall","Smith","Hu",""), 
c( "2" ,"3.7", "3.7", "3.5", "2007 Spring", "2007 Fall", 
"Smith","Hu","Langdon"), c("3" ,"4", "3.2", "4", "2007 Spring", "2007 Fall", 
"Smith","Hu","Langdon"))
> colnames(X) = c('id','bio1Agrade','bio1Bgrade','bio1Cgrade','bio1Asem',
'bio1Bsem','bio1Aprof', 'bio1Bprof','bio1Cprof')

> X
     id  bio1Agrade bio1Bgrade bio1Cgrade bio1Asem      bio1Bsem     bio1Aprof bio1Bprof bio1Cprof     
[1,] "1" "2.5"      "3.7"      ""         "2006 Fall"   "2007 Fall"  "Smith"   "Hu"      ""
[2,] "2" "3.7"      "3.7"      "3.5"      "2007 Spring" "2007 Fall"  "Smith"   "Hu"      "Langdon" 
[3,] "3" "4"        "3.2"      "4"        "2007 Spring" "2007 Fall"  "Smith"   "Hu"      "Langdon" 

我希望它看起来像这样:

id    course  grade  semester     prof
1     bio1A   2.5    2006 Fall    Smith
1     bio1B   3.7    2007 Fall    Hu 
1     bio1C               
2     bio1A   3.7    2007 Spring  Smith
2     bio1B   3.7    2007 Fall    Hu
2     bio1C   3.5                 Langdon
3     bio1A   4      2007 Spring  Smith
3     bio1B   3.2    2007 Fall    Hu
3     bio1C   4                   Langdon

我认为重塑不会起作用,因为我的所有列名都只是没有任何明显分隔符的字符,并且并非所有课程都有,在这种情况下,3列对应于它。

我还想过尝试使用tidyr的解决方案,我正在努力如何将它用于多个值。

你们有没有人建议如何解决这个问题?重命名列是否更容易,并为“缺少”列并使用重塑的课程添加空列?还有另一种理想的方式吗?

2 个答案:

答案 0 :(得分:1)

希望这有帮助!

library(dplyr)
library(tidyr)

df %>%
  gather(temp_col, value, -id) %>%
  mutate(course = gsub("(.*)(grade|sem|prof)", "\\1", temp_col),
         column_name = gsub("(.*)(grade|sem|prof)","\\2", temp_col)) %>%
  select(-temp_col) %>%
  spread(column_name, value)

输出为:

  id course grade    prof         sem
1  1  bio1A   2.5   Smith   2006 Fall
2  1  bio1B   3.7      Hu   2007 Fall
3  1  bio1C  <NA>                <NA>
4  2  bio1A   3.7   Smith 2007 Spring
5  2  bio1B   3.7      Hu   2007 Fall
6  2  bio1C   3.5 Langdon        <NA>
7  3  bio1A     4   Smith 2007 Spring
8  3  bio1B   3.2      Hu   2007 Fall
9  3  bio1C     4 Langdon        <NA>

示例数据:

df <- structure(list(id = 1:3, bio1Agrade = c(2.5, 3.7, 4), bio1Bgrade = c(3.7, 
3.7, 3.2), bio1Cgrade = c(NA, 3.5, 4), bio1Asem = c("2006 Fall", 
"2007 Spring", "2007 Spring"), bio1Bsem = c("2007 Fall", "2007 Fall", 
"2007 Fall"), bio1Aprof = c("Smith", "Smith", "Smith"), bio1Bprof = c("Hu", 
"Hu", "Hu"), bio1Cprof = c("", "Langdon", "Langdon")), .Names = c("id", 
"bio1Agrade", "bio1Bgrade", "bio1Cgrade", "bio1Asem", "bio1Bsem", 
"bio1Aprof", "bio1Bprof", "bio1Cprof"), class = "data.frame", row.names = c(NA, 
-3L))

答案 1 :(得分:1)

我们可以使用melt中的data.table执行此操作,这可能需要多个measure patterns

library(data.table)
nm1 <- substr(names(df)[-1], 1, 5)
melt(setDT(df), measure = patterns("grade$", "prof$", "sem$"), 
   value.name = c("grade", "prof", "sem"),
     variable.name = "course")[, course := nm1[course]][order(id)]
#   id course grade    prof         sem
#1:  1  bio1A   2.5   Smith   2006 Fall
#2:  1  bio1B   3.7      Hu   2007 Fall
#3:  1  bio1C    NA                  NA
#4:  2  bio1A   3.7   Smith 2007 Spring
#5:  2  bio1B   3.7      Hu   2007 Fall
#6:  2  bio1C   3.5 Langdon          NA
#7:  3  bio1A   4.0   Smith 2007 Spring
#8:  3  bio1B   3.2      Hu   2007 Fall
#9:  3  bio1C   4.0 Langdon          NA