使用序数子组扩展数据

时间:2014-08-21 00:07:16

标签: r reshape dplyr reshape2

我有一个类似于这个模拟数据的数据集(~200K行):

mydf <- read.csv(url("http://pastebin.com/raw.php?i=YWTW98Pu"), header=T)

我试图将它从当前的形式扩展到我完全理解的是一个不整洁,糟糕的模式(外部要求等)。

具体来说,我要做的是每个学生有一行,每个重复的字段编号 - 即。

| StudentID | Major     | University | Birthday | EnrollmentDate | CourseID.1 | CourseStartDate.1 | CourseEndDate.1 | CourseDescription.1 | Instructor.1 | Hours.1 | CourseID.2 | CourseStartDate.2 | CourseEndDate.2 | CourseDescription.2 | Instructor.2   | Hours.2 | CourseID.3... (etc) |
|-----------+-----------+------------+----------+----------------+------------+-------------------+-----------------+---------------------+--------------+---------+------------+-------------------+-----------------+---------------------+----------------+---------+---------------------|
|         1 | Economics | Oxford     | 4/9/1956 | 9/1/2001       |        100 | 8/15/2014         | 8/15/2014       | Stats With Cats     | Charlie Kufs |       3 |        101 | 8/16/2014         | 8/16/2014       | Fun with Cthulhu    | James Hatfield |       1 |                     |

我遇到的问题是我希望课程变量按顺序编号 - 即。每位学生1,2,3,4 ... n。也就是说,对于他们所采用的每门课程,我希望列名与他们参加课程的顺序相关,而不是由特定日期或课程ID标记的顺序。

我看到的重塑示例都想用实际值来命名加宽的列 - 例如。 EnrollmentDate9 / 1/2001

3 个答案:

答案 0 :(得分:2)

您的示例有点奇怪,因为Birthday和EnrollmentDate值随StudentID而变化。所以我最终放弃了它们进行这种转换,因为它不允许崩溃。

所以基本上我只需要添加一个ID来使Student / Class独一无二。然后我只使用了基础reshape函数。

mydf <- read.csv(url("http://pastebin.com/raw.php?i=YWTW98Pu"), header=T)
reshape(
    transform(mydf, StClID = ave(1:nrow(mydf), StudentID, FUN=seq_along)),
    timevar = "StClID", 
    idvar = names(mydf)[1:3], 
    v.names=names(mydf)[6:11], 
    drop=names(mydf)[4:5], 
    direction = "wide"
)

此结果的前几列是

   StudentID        Major     University CourseID.1 CourseStartDate.1
1          1    Economics         Oxford        100         8/15/2014
5          2 Anthropology Phoenix Online        100         8/15/2014
15         3        Music        Harvard        100         8/15/2014
22         4  Engineering          DeVry        100         8/15/2014
25         5          Art     Bob Ross U        100         8/15/2014
   CourseEndDate.1 Course.Description.1
1        8/15/2014      Stats With Cats
5        8/15/2014      Stats With Cats
15       8/15/2014      Stats With Cats
22       8/15/2014      Stats With Cats
25       8/15/2014      Stats With Cats

答案 1 :(得分:1)

这有点乱,但是对于数据的大小,它应该相当有效。 @MrFlicks答案更好,但我已经开始研究这个问题了,它提供了一种不同的方法。

#Set object as data.table
require(data.table)
setDT(mydf)

#Convert vars to character
mydf <- mydf[ , lapply(.SD,as.character)]

#Group up vars that change from course to course
mydf2 <- mydf[ , list(list(.SD)), by=list(StudentID,Major,University)]

#Arrange variables that change over time
oth.vars <- rbindlist(lapply(mydf2$V1,function(x) as.data.table(t(unlist(x)))),fill=TRUE)
setcolorder(oth.vars,names(oth.vars)[order(names(oth.vars))])

#Recombine data
mydf2 <- cbind(mydf2,oth.vars)
mydf2[ , V1 := NULL]

还有一点输出:

   StudentID        Major     University  Birthday1 Birthday10  Birthday2  Birthday3  Birthday4  Birthday5
1:         1    Economics         Oxford   4/9/1956         NA  4/10/1956  4/11/1956  4/12/1956         NA
2:         2 Anthropology Phoenix Online 12/15/1970 12/24/1970 12/16/1970 12/17/1970 12/18/1970 12/19/1970
3:         3        Music        Harvard  7/30/1967         NA  7/31/1967   8/1/1967   8/2/1967   8/3/1967
4:         4  Engineering          DeVry 12/11/1978         NA 12/12/1978 12/13/1978         NA         NA
5:         5          Art     Bob Ross U  7/25/1985         NA  7/26/1985  7/27/1985  7/28/1985  7/29/1985

答案 2 :(得分:0)

以下内容可能有用,但所有细节都合并在一列中(如果需要,可以使用strsplit分隔):

firstpart = paste(mydf[,1],mydf[,2],mydf[,3],mydf[,4],mydf[,5],sep=",")
secondpart = paste(mydf[,6],mydf[,7],mydf[,8],mydf[,9],mydf[,10],mydf[,11],sep=",")

duplist = which(duplicated(mydf[,1]))
entrystr = ""
outdf = data.frame(ALLDETAILS=character(), stringsAsFactors=F)
for(i in 1:nrow(mydf)){
    if(i %in% duplist){
        entrystr=paste(entrystr, secondpart[i], sep=';')
    }
    else {
        if(i>1)outdf[nrow(outdf)+1,]=entrystr;
        entrystr=paste(firstpart[i], secondpart[i], sep=';')
    }
}
outdf[nrow(outdf)+1,]=entrystr;
outdf

                                                                                                                                           ALLDETAILS
1                                                                                                                                                                                                                                                                                                                                                                  1,Economics,Oxford,4/9/1956,9/1/2001;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3
2 2,Anthropology,Phoenix Online,12/15/1970,8/15/2003;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1;106,8/21/2014,8/21/2014,Paint By Numbers,Max Klein,1;107,8/22/2014,8/22/2014,Where IS Waldo?,Martin Handford,3;108,8/23/2014,8/23/2014,Drugs Not Hugs,Nancy Reagan,1;109,8/24/2014,8/24/2014,Whirled Peas,Bo (dog),3
3                                                                                                                                                                                3,Music,Harvard,7/30/1967,9/27/1999;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1;106,8/21/2014,8/21/2014,Paint By Numbers,Max Klein,1
4                                                                                                                                                                                                                                                                                                                                                                                                                      4,Engineering,DeVry,12/11/1978,1/16/1949;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3
5                                                                                                                                                                                                                                     5,Art,Bob Ross U,7/25/1985,6/5/2008;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1

数据如下:

mydf = structure(list(StudentID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 
5L, 5L, 5L, 5L, 5L, 5L), Major = structure(c(3L, 3L, 3L, 3L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Anthropology", 
"Art", "Economics", "Engineering", "Music"), class = "factor"), 
    University = structure(c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
    5L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
    2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Bob Ross U", "DeVry", 
    "Harvard", "Oxford", "Phoenix Online"), class = "factor"), 
    Birthday = structure(c(17L, 14L, 15L, 16L, 4L, 5L, 6L, 7L, 
    8L, 9L, 10L, 11L, 12L, 13L, 23L, 25L, 26L, 27L, 28L, 29L, 
    30L, 1L, 2L, 3L, 18L, 19L, 20L, 21L, 22L, 24L), .Label = c("12/11/1978", 
    "12/12/1978", "12/13/1978", "12/15/1970", "12/16/1970", "12/17/1970", 
    "12/18/1970", "12/19/1970", "12/20/1970", "12/21/1970", "12/22/1970", 
    "12/23/1970", "12/24/1970", "4/10/1956", "4/11/1956", "4/12/1956", 
    "4/9/1956", "7/25/1985", "7/26/1985", "7/27/1985", "7/28/1985", 
    "7/29/1985", "7/30/1967", "7/30/1985", "7/31/1967", "8/1/1967", 
    "8/2/1967", "8/3/1967", "8/4/1967", "8/5/1967"), class = "factor"), 
    EnrollmentDate = structure(c(23L, 24L, 29L, 30L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 25L, 26L, 27L, 28L, 
    1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 7L), .Label = c("10/1/1999", 
    "10/2/1999", "10/3/1999", "1/16/1949", "1/17/1949", "1/18/1949", 
    "6/10/2008", "6/5/2008", "6/6/2008", "6/7/2008", "6/8/2008", 
    "6/9/2008", "8/15/2003", "8/16/2003", "8/17/2003", "8/18/2003", 
    "8/19/2003", "8/20/2003", "8/21/2003", "8/22/2003", "8/23/2003", 
    "8/24/2003", "9/1/2001", "9/2/2001", "9/27/1999", "9/28/1999", 
    "9/29/1999", "9/30/1999", "9/3/2001", "9/4/2001"), class = "factor"), 
    CourseID = c(100L, 101L, 102L, 103L, 100L, 101L, 102L, 103L, 
    104L, 105L, 106L, 107L, 108L, 109L, 100L, 101L, 102L, 103L, 
    104L, 105L, 106L, 100L, 101L, 102L, 100L, 101L, 102L, 103L, 
    104L, 105L), CourseStartDate = structure(c(1L, 2L, 3L, 4L, 
    1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 
    5L, 6L, 7L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("8/15/2014", 
    "8/16/2014", "8/17/2014", "8/18/2014", "8/19/2014", "8/20/2014", 
    "8/21/2014", "8/22/2014", "8/23/2014", "8/24/2014"), class = "factor"), 
    CourseEndDate = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
    5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 
    2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("8/15/2014", 
    "8/16/2014", "8/17/2014", "8/18/2014", "8/19/2014", "8/20/2014", 
    "8/21/2014", "8/22/2014", "8/23/2014", "8/24/2014"), class = "factor"), 
    Course.Description = structure(c(7L, 4L, 8L, 1L, 7L, 4L, 
    8L, 1L, 3L, 5L, 6L, 9L, 2L, 10L, 7L, 4L, 8L, 1L, 3L, 5L, 
    6L, 7L, 4L, 8L, 7L, 4L, 8L, 1L, 3L, 5L), .Label = c("Cake for Breakfast", 
    "Drugs Not Hugs", "Flattening Ones Wang", "Fun with Cthulhu", 
    "Lemon Party Home Economics", "Paint By Numbers", "Stats With Cats", 
    "The Spaghetti Monster and U", "Where IS Waldo?", "Whirled Peas"
    ), class = "factor"), Instructor = structure(c(4L, 5L, 2L, 
    1L, 4L, 5L, 2L, 1L, 10L, 6L, 8L, 7L, 9L, 3L, 4L, 5L, 2L, 
    1L, 10L, 6L, 8L, 4L, 5L, 2L, 4L, 5L, 2L, 1L, 10L, 6L), .Label = c("Bill Cosby", 
    "Bobby Henderson", "Bo (dog)", "Charlie Kufs", "James Hatfield", 
    "John Holmes", "Martin Handford", "Max Klein", "Nancy Reagan", 
    "Thomas Hendry"), class = "factor"), Hours = c(3L, 1L, 3L, 
    3L, 3L, 1L, 3L, 3L, 4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 
    4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 4L, 1L)), .Names = c("StudentID", 
"Major", "University", "Birthday", "EnrollmentDate", "CourseID", 
"CourseStartDate", "CourseEndDate", "Course.Description", "Instructor", 
"Hours"), class = "data.frame", row.names = c(NA, -30L))