需要'重塑'数据框架

时间:2015-07-02 20:44:37

标签: r reshape

数据集:

      zip acs.pop napps    pperct cgrp zgrp      perc  
 1: 12007      97     2 2.0618557    2    1 25.000000  
 2: 12007      97     2 2.0618557   NA    2 50.000000  
 3: 12007      97     2 2.0618557    1    1 25.000000  
 4: 12008     485     2 0.4123711    2    1 33.333333  
 5: 12008     485     2 0.4123711    4    1 33.333333  
 6: 12008     485     2 0.4123711   NA    1 33.333333  
 7: 12009    7327   187 2.5522042    4   76 26.206897  
 8: 12009    7327   187 2.5522042    1   41 14.137931  
 9: 12009    7327   187 2.5522042    2   23  7.931034  
10: 12009    7327   187 2.5522042   NA  103 35.517241  
11: 12009    7327   187 2.5522042    3   47 16.206897  
12: 12010   28802   580 2.0137490   NA  275 32.163743  
13: 12010   28802   580 2.0137490    4  122 14.269006  
14: 12010   28802   580 2.0137490    1  269 31.461988  
15: 12010   28802   580 2.0137490    2   96 11.228070  
16: 12010   28802   580 2.0137490    3   93 10.877193  
17: 12018    7608   126 1.6561514    3   30 16.129032  
18: 12018    7608   126 1.6561514   NA   60 32.258065  
19: 12018    7608   126 1.6561514    2   14  7.526882  
20: 12018    7608   126 1.6561514    4   57 30.645161  
21: 12018    7608   126 1.6561514    1   25 13.440860  
22: 12019   14841   144 0.9702850   NA   62 30.097087  
23: 12019   14841   144 0.9702850    4   73 35.436893  
24: 12019   14841   144 0.9702850    3   30 14.563107  
25: 12019   14841   144 0.9702850    1   23 11.165049  
26: 12019   14841   144 0.9702850    2   18  8.737864  
27: 12020   31403   343 1.0922523    3   76 14.960630  
28: 12020   31403   343 1.0922523    1   88 17.322835  
29: 12020   31403   343 1.0922523    2   38  7.480315  
30: 12020   31403   343 1.0922523    4  141 27.755906  
31: 12020   31403   343 1.0922523   NA  165 32.480315  
32: 12022    1002     5 0.4990020   NA    4 44.444444  
33: 12022    1002     5 0.4990020    4    2 22.222222  
34: 12022    1002     5 0.4990020    3    1 11.111111  
35: 12022    1002     5 0.4990020    1    1 11.111111 

我知道reshape2或reshape包可以处理这个问题,但我不确定如何处理。我需要最终输出看起来像这样:

zip  acs.pop    napps   pperct  zgrp4   zgrp3   zgrp2   zgrp1   perc4   perc3   perc2   perc1  
12009   7327    187 2.5522042   76  47  23  41  26.206897   16.206897   7.931034    14.137931

zip是id
acs.pop,napps,pperct对于每个zip组都是相同的 zgrp4 ... zgrp1是cgrp
每个值的zgrp值 perc4 ... perc1是cgrp

的每个值的perc值

2 个答案:

答案 0 :(得分:5)

我们可以从dcast的devel版本尝试data.table,它可以包含多个value.var列。在这种情况下,我们有' zgrp'和' perc'是值列。使用分组变量,我们创建一个序列变量(' ind'),然后使用dcast转换来自' long'广泛的'格式。

安装devel版本的说明是here

 library(data.table)#v1.9.5
 setDT(df1)[, ind:= 1:.N, .(zip, acs.pop, napps, pperct)]
 dcast(df1, zip+acs.pop + napps+pperct~ind, value.var=c('zgrp', 'perc'))
 #     zip acs.pop napps    pperct 1_zgrp 2_zgrp 3_zgrp 4_zgrp 5_zgrp   1_perc
#1: 12007      97     2 2.0618557      1      2      1     NA     NA 25.00000
#2: 12008     485     2 0.4123711      1      1      1     NA     NA 33.33333
#3: 12009    7327   187 2.5522042     76     41     23    103     47 26.20690
#4: 12010   28802   580 2.0137490    275    122    269     96     93 32.16374
#5: 12018    7608   126 1.6561514     30     60     14     57     25 16.12903
#6: 12019   14841   144 0.9702850     62     73     30     23     18 30.09709
#7: 12020   31403   343 1.0922523     76     88     38    141    165 14.96063
#8: 12022    1002     5 0.4990020      4      2      1      1     NA 44.44444
#     2_perc    3_perc   4_perc    5_perc
#1: 50.00000 25.000000       NA        NA
#2: 33.33333 33.333333       NA        NA
#3: 14.13793  7.931034 35.51724 16.206897
#4: 14.26901 31.461988 11.22807 10.877193
#5: 32.25807  7.526882 30.64516 13.440860
#6: 35.43689 14.563107 11.16505  8.737864
#7: 17.32284  7.480315 27.75591 32.480315
#8: 22.22222 11.111111 11.11111        NA

或者我们可以使用ave/reshape

中的base R
 df2 <- transform(df1, ind=ave(seq_along(zip), zip,
                 acs.pop, napps, pperct, FUN=seq_along))
 reshape(df2, idvar=c('zip', 'acs.pop', 'napps', 'pperct'),
                               timevar='ind', direction='wide')

答案 1 :(得分:1)

这是spread()tidyr用途。

df %>% filter(!is.na(cgrp)) %>% # if cgrp is missing I don't know where to put the obs
  gather(Var, Val,6:7) %>% # one row per measure (zgrp OR perc) observed
  group_by(zip, acs.pop, napps, pperct) %>% # unique combos of these will define rows in output
  unite(Var1,Var,cgrp) %>% # indentify which obs for which measure
  spread(Var1, Val) # make columns for zgrp_1, _2, etc., perc1,2, etc

示例输出:

> df2[df2$zip==12009,]
Source: local data frame [1 x 12]

    zip acs.pop napps   pperct   perc_1   perc_2  perc_3  perc_4 zgrp_1 zgrp_2 zgrp_3 zgrp_4
1 12009    7327   187 2.552204 14.13793 7.931034 16.2069 26.2069     41     23     47     76

感谢@akrun的帮助