数据表dcast列标题

时间:2017-09-14 16:37:38

标签: r data.table heading dcast

我有一个表格

的数据表
ID  REGION  INCOME_BAND RESIDENCY_YEARS
1   SW  Under 5,000 10-15
2   Wales   Over 70,000 1-5
3   Center  15,000-19,999   6-9
4   SE  15,000-19,999   15-19
5   North   15,000-19,999   10-15
6   North   15,000-19,999   6-9

创建

exp = data.table(
  ID = c(1,2,3,4,5,6),
  REGION=c("SW", "Wales", "Center", "SE", "North", "North"),
  INCOME_BAND = c("Under ?5,000", "Over ?70,000", "?15,000-?19,999", "?15,000-?19,999", "?15,000-?19,999","?15,000-?19,999"),
  RESIDENCY_YEARS = c("10-15","1-5","6-9","15-19","10-15", "6-9"))

我想将其转换为

Example of the result of any data table manipulation

我成功完成了dcast的大部分工作:

exp.dcast = dcast(exp,ID~REGION+INCOME_BAND+RESIDENCY_YEARS, fun=length,
  value.var=c('REGION', 'INCOME_BAND', 'RESIDENCY_YEARS'))

但是我需要一些帮助来创建合理的列标题。 目前我有

  

[" ID"
  " REGION.1_Center_ 15,000- 19,999_6-9"?
  " REGION.1_North_ 15,000- 19,999_10-15"?
  " REGION.1_North_ 15,000- 19,999_6-9"?
  " REGION.1_SE_ 15,000- 19,999_15-19&#34?; " REGION.1_SW_Under   ?5,000_10-15" " REGION.1_Wales_Over?70,000_1-5"
  " INCOME_BAND.1_Center_ 15,000- 19,999_6-9"?
  " INCOME_BAND.1_North_ 15,000- 19,999_10-15"?
  " INCOME_BAND.1_North_ 15,000- 19,999_6-9"?
  " INCOME_BAND.1_SE_ 15,000- 19,999_15-19"?
  " INCOME_BAND.1_SW_Under?5,000_10-15"
  " INCOME_BAND.1_Wales_Over?70,000_1-5"
  " RESIDENCY_YEARS.1_Center_ 15,000- 19,999_6-9&#34?;   " RESIDENCY_YEARS.1_North_ 15,000- 19,999_10-15&#34?;   " RESIDENCY_YEARS.1_North_ 15,000- 19,999_6-9"?
  " RESIDENCY_YEARS.1_SE_ 15,000- 19,999_15-19"?
  " RESIDENCY_YEARS.1_SW_Under?5,000_10-15"
  " RESIDENCY_YEARS.1_Wales_Over?70,000_1-5"

我希望列标题为

ID  SW  Wales   Center  SE  North   Under 5,000 Over 70,000 15,000-19,999   1-5 6-9 10-15   15-19

有人可以提供建议吗?

1 个答案:

答案 0 :(得分:0)

这个看似简单的问题并不容易回答。所以,我们将一步一步地前进。

首先,OP尝试同时重塑多个值列,这会产生所有可用组合的不需要的交叉积。

为了以相同的方式处理所有值,我们需要在重新整形之前先melt()所有值列:

melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)]
   ID 1-5 10-15 15-19 6-9 ?15,000-?19,999 Center North Over ?70,000 SE SW Under ?5,000 Wales
1:  1   0     1     0   0               0      0     0            0  0  1            1     0
2:  2   1     0     0   0               0      0     0            1  0  0            0     1
3:  3   0     0     0   1               1      1     0            0  0  0            0     0
4:  4   0     0     1   0               1      0     0            0  1  0            0     0
5:  5   0     1     0   0               1      0     1            0  0  0            0     0
6:  6   0     0     0   1               1      0     1            0  0  0            0     0

现在,结果有13列而不是19列,列由相应的值命名。

不幸的是,列按错误顺序出现,因为它们按字母顺序排列。有两种方法可以改变顺序:

重塑后改变列的顺序

setcolorder()功能重新排列data.table 的列,例如没有复制:

# define column order = order of values
col_order <- c("North", "Wales", "Center", "SW", "SE", "Under ?5,000", "?15,000-?19,999", "Over ?70,000", "1-5", "6-9", "10-15", "15-19")

melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)][
  # reorder columns
  , setcolorder(.SD, c("ID", col_order))]
   ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1:  1     0     0      0  1  0            1               0            0   0   0     1     0
2:  2     0     1      0  0  0            0               0            1   1   0     0     0
3:  3     0     0      1  0  0            0               1            0   0   1     0     0
4:  4     0     0      0  0  1            0               1            0   0   0     0     1
5:  5     1     0      0  0  0            0               1            0   0   0     1     0
6:  6     1     0      0  0  0            0               1            0   0   1     0     0

现在,首先显示所有REGION列,然后按指定的顺序显示INCOME_BANDRESIDENCY_YEARS列。

在重塑之前设置因子级别

如果将value转换为具有适当排序因子级别的因子,dcast()将使用因子级别对列进行排序:

melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
  , dcast(.SD, ID ~ value, length)]
   ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1:  1     0     0      0  1  0            1               0            0   0   0     1     0
2:  2     0     1      0  0  0            0               0            1   1   0     0     0
3:  3     0     0      1  0  0            0               1            0   0   1     0     0
4:  4     0     0      0  0  1            0               1            0   0   0     0     1
5:  5     1     0      0  0  0            0               1            0   0   0     1     0
6:  6     1     0      0  0  0            0               1            0   0   1     0     0

在重塑之前设置因子级别 - 延迟版本

如果将列按REGIONINCOME_BANDRESIDENCY_YEARS分组就足够了,那么我们可以使用快捷方式来避免在col_order中指定每个值。 fct_inorder()包中的forcats函数首次出现在向量中时重新排序因子级别:

melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
  , dcast(.SD, ID ~ value, length)]
   ID SW Wales Center SE North Under ?5,000 Over ?70,000 ?15,000-?19,999 10-15 1-5 6-9 15-19
1:  1  1     0      0  0     0            1            0               0     1   0   0     0
2:  2  0     1      0  0     0            0            1               0     0   1   0     0
3:  3  0     0      1  0     0            0            0               1     0   0   1     0
4:  4  0     0      0  1     0            0            0               1     0   0   0     1
5:  5  0     0      0  0     1            0            0               1     1   0   0     0
6:  6  0     0      0  0     1            0            0               1     0   0   1     0

这是有效的,因为melt()的输出按variable排序:

melt(exp, id.vars = "ID")
    ID        variable           value
 1:  1          REGION              SW
 2:  2          REGION           Wales
 3:  3          REGION          Center
 4:  4          REGION              SE
 5:  5          REGION           North
 6:  6          REGION           North
 7:  1     INCOME_BAND    Under ?5,000
 8:  2     INCOME_BAND    Over ?70,000
 9:  3     INCOME_BAND ?15,000-?19,999
10:  4     INCOME_BAND ?15,000-?19,999
11:  5     INCOME_BAND ?15,000-?19,999
12:  6     INCOME_BAND ?15,000-?19,999
13:  1 RESIDENCY_YEARS           10-15
14:  2 RESIDENCY_YEARS             1-5
15:  3 RESIDENCY_YEARS             6-9
16:  4 RESIDENCY_YEARS           15-19
17:  5 RESIDENCY_YEARS           10-15
18:  6 RESIDENCY_YEARS             6-9