我正在尝试将数据重塑为广泛的形式,但是我对此并不陌生,到目前为止,普通的重塑功能似乎都无法正常工作。当我尝试使用全部数据时(虽然此摘录似乎有用),但我获得了所有值的NA值,并且列名变成了奇怪的数值向量。
我的数据如下所示,PNR是每个观察值唯一的id变量。
PNR ZIPCODE SEL_CRITERION QUAL_RATING QUEUENUMBER UPSEC_ID UPSEC_COURSE_ID MARK COURSEOFFERING_ID ADMISSIONROUND_ID RESULT WITHIN_PROGRAM SUMMA
1234567890 46395 HB 55 0 HRF SV203 G 97116 HT2019 20 0 67.5
1234567890 46395 HB 55 0 HRF EN200 VG 97116 HT2019 20 0 67.5
1234567890 46395 HB 55 0 HRF MA200 VG 97116 HT2019 20 0 67.5
1234567890 46395 HB 55 0 HRF <null> <null> 97116 HT2019 20 0 67.5
2345678901 42332 B5 2645 0 3SB EN1201 VG 97116 HT2019 20 0 70.5
2345678901 42332 B5 2645 0 3SB MA1201 VG 97113 HT2019 20 0 70.5
2345678901 42332 B5 2645 0 2SM SV1201 VG 97113 HT2019 20 0 70.5
我希望它看起来像这样:
PNR ZIPCODE HB B5 QUEUENUMBER UPSEC_ID SV203 EN200 MA200 <null> EN1201 MA1201 SV1201
1234567890 46395 95 NA 0 HRF G VG VG NA NA NA NA
2345678901 42332 NA 1645 0 3SB NA NA NA NA VG VG VG
有什么办法可以做到这一点?
我尝试使用普通的重塑函数,该函数可悲地失败(尽管不一定在此小节选中):
test<-reshape(HT2018, idvar="PNR",timevar=c("SEL_CRITERION", "UPSEC_COURSE_ID"), v.names=c("QUAL_RATING","MARK"), direction = "wide")
我还尝试了reshape2包中的melt和cast函数,尽管返回的行中有三个值(都不正确),但我当然可能做错了事:
test<-melt(HT2018, id="PNR")
test<-cast(test, QUAL_RATING + MARK ~ PNR)
structure(list(PNR = c(1234567890, 1234567890, 1234567890, 1234567890,
2345678901, 2345678901, 2345678901), ZIPCODE = c(46395L, 46395L,
46395L, 46395L, 42332L, 42332L, 42332L), SEL_CRITERION = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("B5", "HB "), class = "factor"),
QUAL_RATING = c(55L, 55L, 55L, 55L, 2645L, 2645L, 2645L),
QUEUENUMBER = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), UPSEC_ID = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("3SB", "HRF"), class = "factor"),
UPSEC_COURSE_ID = structure(c(7L, 3L, 5L, 1L, 2L, 4L, 6L), .Label = c("<null>",
"EN1201 ", "EN200 ", "MA1201 ", "MA200 ",
"SV1201 ", "SV203 "), class = "factor"), MARK = structure(c(2L,
3L, 3L, 1L, 3L, 3L, 3L), .Label = c("<null>", "G ", "VG "
), class = "factor"), COURSEOFFERING_ID = c(97113L, 97113L,
97113L, 97113L, 97113L, 97113L, 97113L), ADMISSIONROUND_ID = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "HT2018 ", class = "factor"),
RESULT = c(20L, 20L, 20L, 20L, 20L, 20L, 20L), WITHIN_PROGRAM = c(0L,
0L, 0L, 0L, 0L, 0L, 0L), SUMMA = structure(c(1L, 1L, 1L,
1L, 2L, 2L, 2L), .Label = c("67.5", "70.5"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
答案 0 :(得分:1)
在对值HB / B5(注释)进行汇总之前,可以使用data.table的dcast()将数据转换为跨UPSEC_COURSE_ID的MARK级别的宽格式。
dt = data.table(dt)
dt_betyg = dcast(dt, PNR + ZIPCODE + UPSEC_ID + QUEUENUMBER + SEL_CRITERION + QUAL_RATING ~ UPSEC_COURSE_ID, value.var = c('MARK'))
结果
> dt_betyg
PNR ZIPCODE UPSEC_ID QUEUENUMBER SEL_CRITERION QUAL_RATING <null> EN1201 EN200 MA1201 MA200 SV1201 SV203
1: 1234567890 46395 HRF 0 HB 55 <null> <NA> VG <NA> VG <NA> G
2: 2345678901 42332 3SB 0 B5 2645 <NA> VG <NA> VG <NA> VG <NA>