我有一个类似于这个模拟数据的数据集(~200K行):
mydf <- read.csv(url("http://pastebin.com/raw.php?i=YWTW98Pu"), header=T)
我试图将它从当前的形式扩展到我完全理解的是一个不整洁,糟糕的模式(外部要求等)。
具体来说,我要做的是每个学生有一行,每个重复的字段编号 - 即。
| StudentID | Major | University | Birthday | EnrollmentDate | CourseID.1 | CourseStartDate.1 | CourseEndDate.1 | CourseDescription.1 | Instructor.1 | Hours.1 | CourseID.2 | CourseStartDate.2 | CourseEndDate.2 | CourseDescription.2 | Instructor.2 | Hours.2 | CourseID.3... (etc) |
|-----------+-----------+------------+----------+----------------+------------+-------------------+-----------------+---------------------+--------------+---------+------------+-------------------+-----------------+---------------------+----------------+---------+---------------------|
| 1 | Economics | Oxford | 4/9/1956 | 9/1/2001 | 100 | 8/15/2014 | 8/15/2014 | Stats With Cats | Charlie Kufs | 3 | 101 | 8/16/2014 | 8/16/2014 | Fun with Cthulhu | James Hatfield | 1 | |
我遇到的问题是我希望课程变量按顺序编号 - 即。每位学生1,2,3,4 ... n。也就是说,对于他们所采用的每门课程,我希望列名与他们参加课程的顺序相关,而不是由特定日期或课程ID标记的顺序。
我看到的重塑示例都想用实际值来命名加宽的列 - 例如。 EnrollmentDate9 / 1/2001
答案 0 :(得分:2)
您的示例有点奇怪,因为Birthday和EnrollmentDate值随StudentID而变化。所以我最终放弃了它们进行这种转换,因为它不允许崩溃。
所以基本上我只需要添加一个ID来使Student / Class独一无二。然后我只使用了基础reshape
函数。
mydf <- read.csv(url("http://pastebin.com/raw.php?i=YWTW98Pu"), header=T)
reshape(
transform(mydf, StClID = ave(1:nrow(mydf), StudentID, FUN=seq_along)),
timevar = "StClID",
idvar = names(mydf)[1:3],
v.names=names(mydf)[6:11],
drop=names(mydf)[4:5],
direction = "wide"
)
此结果的前几列是
StudentID Major University CourseID.1 CourseStartDate.1
1 1 Economics Oxford 100 8/15/2014
5 2 Anthropology Phoenix Online 100 8/15/2014
15 3 Music Harvard 100 8/15/2014
22 4 Engineering DeVry 100 8/15/2014
25 5 Art Bob Ross U 100 8/15/2014
CourseEndDate.1 Course.Description.1
1 8/15/2014 Stats With Cats
5 8/15/2014 Stats With Cats
15 8/15/2014 Stats With Cats
22 8/15/2014 Stats With Cats
25 8/15/2014 Stats With Cats
答案 1 :(得分:1)
这有点乱,但是对于数据的大小,它应该相当有效。 @MrFlicks答案更好,但我已经开始研究这个问题了,它提供了一种不同的方法。
#Set object as data.table
require(data.table)
setDT(mydf)
#Convert vars to character
mydf <- mydf[ , lapply(.SD,as.character)]
#Group up vars that change from course to course
mydf2 <- mydf[ , list(list(.SD)), by=list(StudentID,Major,University)]
#Arrange variables that change over time
oth.vars <- rbindlist(lapply(mydf2$V1,function(x) as.data.table(t(unlist(x)))),fill=TRUE)
setcolorder(oth.vars,names(oth.vars)[order(names(oth.vars))])
#Recombine data
mydf2 <- cbind(mydf2,oth.vars)
mydf2[ , V1 := NULL]
还有一点输出:
StudentID Major University Birthday1 Birthday10 Birthday2 Birthday3 Birthday4 Birthday5
1: 1 Economics Oxford 4/9/1956 NA 4/10/1956 4/11/1956 4/12/1956 NA
2: 2 Anthropology Phoenix Online 12/15/1970 12/24/1970 12/16/1970 12/17/1970 12/18/1970 12/19/1970
3: 3 Music Harvard 7/30/1967 NA 7/31/1967 8/1/1967 8/2/1967 8/3/1967
4: 4 Engineering DeVry 12/11/1978 NA 12/12/1978 12/13/1978 NA NA
5: 5 Art Bob Ross U 7/25/1985 NA 7/26/1985 7/27/1985 7/28/1985 7/29/1985
答案 2 :(得分:0)
以下内容可能有用,但所有细节都合并在一列中(如果需要,可以使用strsplit分隔):
firstpart = paste(mydf[,1],mydf[,2],mydf[,3],mydf[,4],mydf[,5],sep=",")
secondpart = paste(mydf[,6],mydf[,7],mydf[,8],mydf[,9],mydf[,10],mydf[,11],sep=",")
duplist = which(duplicated(mydf[,1]))
entrystr = ""
outdf = data.frame(ALLDETAILS=character(), stringsAsFactors=F)
for(i in 1:nrow(mydf)){
if(i %in% duplist){
entrystr=paste(entrystr, secondpart[i], sep=';')
}
else {
if(i>1)outdf[nrow(outdf)+1,]=entrystr;
entrystr=paste(firstpart[i], secondpart[i], sep=';')
}
}
outdf[nrow(outdf)+1,]=entrystr;
outdf
ALLDETAILS
1 1,Economics,Oxford,4/9/1956,9/1/2001;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3
2 2,Anthropology,Phoenix Online,12/15/1970,8/15/2003;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1;106,8/21/2014,8/21/2014,Paint By Numbers,Max Klein,1;107,8/22/2014,8/22/2014,Where IS Waldo?,Martin Handford,3;108,8/23/2014,8/23/2014,Drugs Not Hugs,Nancy Reagan,1;109,8/24/2014,8/24/2014,Whirled Peas,Bo (dog),3
3 3,Music,Harvard,7/30/1967,9/27/1999;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1;106,8/21/2014,8/21/2014,Paint By Numbers,Max Klein,1
4 4,Engineering,DeVry,12/11/1978,1/16/1949;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3
5 5,Art,Bob Ross U,7/25/1985,6/5/2008;100,8/15/2014,8/15/2014,Stats With Cats,Charlie Kufs,3;101,8/16/2014,8/16/2014,Fun with Cthulhu,James Hatfield,1;102,8/17/2014,8/17/2014,The Spaghetti Monster and U,Bobby Henderson,3;103,8/18/2014,8/18/2014,Cake for Breakfast,Bill Cosby,3;104,8/19/2014,8/19/2014,Flattening Ones Wang,Thomas Hendry,4;105,8/20/2014,8/20/2014,Lemon Party Home Economics,John Holmes,1
数据如下:
mydf = structure(list(StudentID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
5L, 5L, 5L, 5L, 5L, 5L), Major = structure(c(3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 4L, 4L, 4L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Anthropology",
"Art", "Economics", "Engineering", "Music"), class = "factor"),
University = structure(c(4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Bob Ross U", "DeVry",
"Harvard", "Oxford", "Phoenix Online"), class = "factor"),
Birthday = structure(c(17L, 14L, 15L, 16L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 11L, 12L, 13L, 23L, 25L, 26L, 27L, 28L, 29L,
30L, 1L, 2L, 3L, 18L, 19L, 20L, 21L, 22L, 24L), .Label = c("12/11/1978",
"12/12/1978", "12/13/1978", "12/15/1970", "12/16/1970", "12/17/1970",
"12/18/1970", "12/19/1970", "12/20/1970", "12/21/1970", "12/22/1970",
"12/23/1970", "12/24/1970", "4/10/1956", "4/11/1956", "4/12/1956",
"4/9/1956", "7/25/1985", "7/26/1985", "7/27/1985", "7/28/1985",
"7/29/1985", "7/30/1967", "7/30/1985", "7/31/1967", "8/1/1967",
"8/2/1967", "8/3/1967", "8/4/1967", "8/5/1967"), class = "factor"),
EnrollmentDate = structure(c(23L, 24L, 29L, 30L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 25L, 26L, 27L, 28L,
1L, 2L, 3L, 4L, 5L, 6L, 8L, 9L, 10L, 11L, 12L, 7L), .Label = c("10/1/1999",
"10/2/1999", "10/3/1999", "1/16/1949", "1/17/1949", "1/18/1949",
"6/10/2008", "6/5/2008", "6/6/2008", "6/7/2008", "6/8/2008",
"6/9/2008", "8/15/2003", "8/16/2003", "8/17/2003", "8/18/2003",
"8/19/2003", "8/20/2003", "8/21/2003", "8/22/2003", "8/23/2003",
"8/24/2003", "9/1/2001", "9/2/2001", "9/27/1999", "9/28/1999",
"9/29/1999", "9/30/1999", "9/3/2001", "9/4/2001"), class = "factor"),
CourseID = c(100L, 101L, 102L, 103L, 100L, 101L, 102L, 103L,
104L, 105L, 106L, 107L, 108L, 109L, 100L, 101L, 102L, 103L,
104L, 105L, 106L, 100L, 101L, 102L, 100L, 101L, 102L, 103L,
104L, 105L), CourseStartDate = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("8/15/2014",
"8/16/2014", "8/17/2014", "8/18/2014", "8/19/2014", "8/20/2014",
"8/21/2014", "8/22/2014", "8/23/2014", "8/24/2014"), class = "factor"),
CourseEndDate = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L,
2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("8/15/2014",
"8/16/2014", "8/17/2014", "8/18/2014", "8/19/2014", "8/20/2014",
"8/21/2014", "8/22/2014", "8/23/2014", "8/24/2014"), class = "factor"),
Course.Description = structure(c(7L, 4L, 8L, 1L, 7L, 4L,
8L, 1L, 3L, 5L, 6L, 9L, 2L, 10L, 7L, 4L, 8L, 1L, 3L, 5L,
6L, 7L, 4L, 8L, 7L, 4L, 8L, 1L, 3L, 5L), .Label = c("Cake for Breakfast",
"Drugs Not Hugs", "Flattening Ones Wang", "Fun with Cthulhu",
"Lemon Party Home Economics", "Paint By Numbers", "Stats With Cats",
"The Spaghetti Monster and U", "Where IS Waldo?", "Whirled Peas"
), class = "factor"), Instructor = structure(c(4L, 5L, 2L,
1L, 4L, 5L, 2L, 1L, 10L, 6L, 8L, 7L, 9L, 3L, 4L, 5L, 2L,
1L, 10L, 6L, 8L, 4L, 5L, 2L, 4L, 5L, 2L, 1L, 10L, 6L), .Label = c("Bill Cosby",
"Bobby Henderson", "Bo (dog)", "Charlie Kufs", "James Hatfield",
"John Holmes", "Martin Handford", "Max Klein", "Nancy Reagan",
"Thomas Hendry"), class = "factor"), Hours = c(3L, 1L, 3L,
3L, 3L, 1L, 3L, 3L, 4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L,
4L, 1L, 1L, 3L, 1L, 3L, 3L, 1L, 3L, 3L, 4L, 1L)), .Names = c("StudentID",
"Major", "University", "Birthday", "EnrollmentDate", "CourseID",
"CourseStartDate", "CourseEndDate", "Course.Description", "Instructor",
"Hours"), class = "data.frame", row.names = c(NA, -30L))