将数据重整为宽格式的最佳方法是什么?

时间:2019-10-14 14:30:11

标签: r reshape reshape2

我正在尝试将数据重塑为广泛的形式,但是我对此并不陌生,到目前为止,普通的重塑功能似乎都无法正常工作。当我尝试使用全部数据时(虽然此摘录似乎有用),但我获得了所有值的NA值,并且列名变成了奇怪的数值向量。

我的数据如下所示,PNR是每个观察值唯一的id变量。

    PNR ZIPCODE SEL_CRITERION   QUAL_RATING QUEUENUMBER UPSEC_ID    UPSEC_COURSE_ID MARK    COURSEOFFERING_ID   ADMISSIONROUND_ID   RESULT  WITHIN_PROGRAM  SUMMA
1234567890  46395   HB      55  0   HRF         SV203       G   97116   HT2019      20  0   67.5
1234567890  46395   HB      55  0   HRF         EN200       VG  97116   HT2019      20  0   67.5
1234567890  46395   HB      55  0   HRF         MA200       VG  97116   HT2019      20  0   67.5
1234567890  46395   HB      55  0   HRF         <null>  <null>  97116   HT2019      20  0   67.5
2345678901  42332   B5      2645    0   3SB         EN1201      VG  97116   HT2019      20  0   70.5
2345678901  42332   B5      2645    0   3SB         MA1201      VG  97113   HT2019      20  0   70.5
2345678901  42332   B5      2645    0   2SM         SV1201      VG  97113   HT2019      20  0   70.5

我希望它看起来像这样:

PNR ZIPCODE      HB  B5   QUEUENUMBER UPSEC_ID SV203 EN200 MA200 <null> EN1201 MA1201 SV1201
1234567890 46395 95  NA    0           HRF      G     VG    VG    NA     NA     NA     NA
2345678901 42332 NA  1645  0           3SB      NA    NA    NA    NA     VG     VG     VG

有什么办法可以做到这一点?

我尝试使用普通的重塑函数,该函数可悲地失败(尽管不一定在此小节选中):

test<-reshape(HT2018, idvar="PNR",timevar=c("SEL_CRITERION", "UPSEC_COURSE_ID"), v.names=c("QUAL_RATING","MARK"), direction = "wide")

我还尝试了reshape2包中的melt和cast函数,尽管返回的行中有三个值(都不正确),但我当然可能做错了事:

test<-melt(HT2018, id="PNR")
test<-cast(test, QUAL_RATING + MARK ~ PNR)

structure(list(PNR = c(1234567890, 1234567890, 1234567890, 1234567890, 
2345678901, 2345678901, 2345678901), ZIPCODE = c(46395L, 46395L, 
46395L, 46395L, 42332L, 42332L, 42332L), SEL_CRITERION = structure(c(2L, 
2L, 2L, 2L, 1L, 1L, 1L), .Label = c("B5", "HB   "), class = "factor"), 
    QUAL_RATING = c(55L, 55L, 55L, 55L, 2645L, 2645L, 2645L), 
    QUEUENUMBER = c(0L, 0L, 0L, 0L, 0L, 0L, 0L), UPSEC_ID = structure(c(2L, 
    2L, 2L, 2L, 1L, 1L, 1L), .Label = c("3SB", "HRF"), class = "factor"), 
    UPSEC_COURSE_ID = structure(c(7L, 3L, 5L, 1L, 2L, 4L, 6L), .Label = c("<null>", 
    "EN1201     ", "EN200      ", "MA1201     ", "MA200      ", 
    "SV1201     ", "SV203      "), class = "factor"), MARK = structure(c(2L, 
    3L, 3L, 1L, 3L, 3L, 3L), .Label = c("<null>", "G  ", "VG "
    ), class = "factor"), COURSEOFFERING_ID = c(97113L, 97113L, 
    97113L, 97113L, 97113L, 97113L, 97113L), ADMISSIONROUND_ID = structure(c(1L, 
    1L, 1L, 1L, 1L, 1L, 1L), .Label = "HT2018    ", class = "factor"), 
    RESULT = c(20L, 20L, 20L, 20L, 20L, 20L, 20L), WITHIN_PROGRAM = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L), SUMMA = structure(c(1L, 1L, 1L, 
    1L, 2L, 2L, 2L), .Label = c("67.5", "70.5"), class = "factor")), class = "data.frame", row.names = c(NA, 
-7L))

1 个答案:

答案 0 :(得分:1)

在对值HB / B5(注释)进行汇总之前,可以使用data.table的dcast()将数据转换为跨UPSEC_COURSE_ID的MARK级别的宽格式。

dt = data.table(dt)
dt_betyg = dcast(dt, PNR + ZIPCODE + UPSEC_ID + QUEUENUMBER + SEL_CRITERION + QUAL_RATING ~ UPSEC_COURSE_ID, value.var = c('MARK'))

结果

> dt_betyg
          PNR ZIPCODE UPSEC_ID QUEUENUMBER SEL_CRITERION QUAL_RATING <null> EN1201      EN200       MA1201      MA200       SV1201      SV203      
1: 1234567890   46395      HRF           0         HB             55 <null>        <NA>         VG         <NA>         VG         <NA>         G  
2: 2345678901   42332      3SB           0            B5        2645   <NA>         VG         <NA>         VG         <NA>         VG         <NA>