我有一个宽数据框,我想将其转换为长数据帧。
这不是我正在使用的实际宽数据帧。每门课程还有更多的课程和更多的“价值观”,因此数据框架比这更广泛。并非每个课程都有与之关联的所有值列(因此,为什么bio1Csem不在下面的数据框中)。
我决定尝试在较小的数据帧上使用解决方案,因为我在较大的数据帧上遇到了很多问题。不幸的是,我还在苦苦挣扎。
我正在使用的数据框:
>X = rbind( c( "1", "2.5","3.7","","2006 Fall","2007 Fall","Smith","Hu",""),
c( "2" ,"3.7", "3.7", "3.5", "2007 Spring", "2007 Fall",
"Smith","Hu","Langdon"), c("3" ,"4", "3.2", "4", "2007 Spring", "2007 Fall",
"Smith","Hu","Langdon"))
> colnames(X) = c('id','bio1Agrade','bio1Bgrade','bio1Cgrade','bio1Asem',
'bio1Bsem','bio1Aprof', 'bio1Bprof','bio1Cprof')
> X
id bio1Agrade bio1Bgrade bio1Cgrade bio1Asem bio1Bsem bio1Aprof bio1Bprof bio1Cprof
[1,] "1" "2.5" "3.7" "" "2006 Fall" "2007 Fall" "Smith" "Hu" ""
[2,] "2" "3.7" "3.7" "3.5" "2007 Spring" "2007 Fall" "Smith" "Hu" "Langdon"
[3,] "3" "4" "3.2" "4" "2007 Spring" "2007 Fall" "Smith" "Hu" "Langdon"
我希望它看起来像这样:
id course grade semester prof
1 bio1A 2.5 2006 Fall Smith
1 bio1B 3.7 2007 Fall Hu
1 bio1C
2 bio1A 3.7 2007 Spring Smith
2 bio1B 3.7 2007 Fall Hu
2 bio1C 3.5 Langdon
3 bio1A 4 2007 Spring Smith
3 bio1B 3.2 2007 Fall Hu
3 bio1C 4 Langdon
我认为重塑不会起作用,因为我的所有列名都只是没有任何明显分隔符的字符,并且并非所有课程都有,在这种情况下,3列对应于它。
我还想过尝试使用tidyr的解决方案,我正在努力如何将它用于多个值。
你们有没有人建议如何解决这个问题?重命名列是否更容易,并为“缺少”列并使用重塑的课程添加空列?还有另一种理想的方式吗?
答案 0 :(得分:1)
希望这有帮助!
library(dplyr)
library(tidyr)
df %>%
gather(temp_col, value, -id) %>%
mutate(course = gsub("(.*)(grade|sem|prof)", "\\1", temp_col),
column_name = gsub("(.*)(grade|sem|prof)","\\2", temp_col)) %>%
select(-temp_col) %>%
spread(column_name, value)
输出为:
id course grade prof sem
1 1 bio1A 2.5 Smith 2006 Fall
2 1 bio1B 3.7 Hu 2007 Fall
3 1 bio1C <NA> <NA>
4 2 bio1A 3.7 Smith 2007 Spring
5 2 bio1B 3.7 Hu 2007 Fall
6 2 bio1C 3.5 Langdon <NA>
7 3 bio1A 4 Smith 2007 Spring
8 3 bio1B 3.2 Hu 2007 Fall
9 3 bio1C 4 Langdon <NA>
示例数据:
df <- structure(list(id = 1:3, bio1Agrade = c(2.5, 3.7, 4), bio1Bgrade = c(3.7,
3.7, 3.2), bio1Cgrade = c(NA, 3.5, 4), bio1Asem = c("2006 Fall",
"2007 Spring", "2007 Spring"), bio1Bsem = c("2007 Fall", "2007 Fall",
"2007 Fall"), bio1Aprof = c("Smith", "Smith", "Smith"), bio1Bprof = c("Hu",
"Hu", "Hu"), bio1Cprof = c("", "Langdon", "Langdon")), .Names = c("id",
"bio1Agrade", "bio1Bgrade", "bio1Cgrade", "bio1Asem", "bio1Bsem",
"bio1Aprof", "bio1Bprof", "bio1Cprof"), class = "data.frame", row.names = c(NA,
-3L))
答案 1 :(得分:1)
我们可以使用melt
中的data.table
执行此操作,这可能需要多个measure
patterns
library(data.table)
nm1 <- substr(names(df)[-1], 1, 5)
melt(setDT(df), measure = patterns("grade$", "prof$", "sem$"),
value.name = c("grade", "prof", "sem"),
variable.name = "course")[, course := nm1[course]][order(id)]
# id course grade prof sem
#1: 1 bio1A 2.5 Smith 2006 Fall
#2: 1 bio1B 3.7 Hu 2007 Fall
#3: 1 bio1C NA NA
#4: 2 bio1A 3.7 Smith 2007 Spring
#5: 2 bio1B 3.7 Hu 2007 Fall
#6: 2 bio1C 3.5 Langdon NA
#7: 3 bio1A 4.0 Smith 2007 Spring
#8: 3 bio1B 3.2 Hu 2007 Fall
#9: 3 bio1C 4.0 Langdon NA