根据数据中其他变量的分布自动创建新变量

时间:2020-06-16 10:41:57

标签: r data.table

我的数据如下:

编辑:

原始数据样本

DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01", 
year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2, 
upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568, 
from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451, 
c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168, 
native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209, 
c48 = 641), row.names = c(NA, -1L), class = c("data.table", 
"data.frame"))

原始数据的熔融样本

DT <- structure(list(Abbreviation = c("AK", "AK", "AK", "AK", "AK", 
"AK", "AK", "AK", "AK", "AK"), date = c("1/31/2011", "10/31/2011", 
"11/30/2011", "12/31/2010", "4/30/2005", "2/28/2011", "3/31/2011", 
"4/30/2011", "5/31/2011", "6/30/2011"), year = c("2011", "2011", 
"2011", "2010", "2005", "2011", "2011", "2011", "2011", "2011"
), c1 = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P"), 
    State = c("Alaska", "Alaska", "Alaska", "Alaska", "Alaska", 
    "Alaska", "Alaska", "Alaska", "Alaska", "Alaska"), month = c("01", 
    "10", "11", "12", "04", "02", "03", "04", "05", "06"), total = c(18395, 
    10654, 14113, 16248, 14029, 17915, 17152, 15543, 13325, 11637
    ), variable = structure(c(1L, 2L, 4L, 5L, 13L, 17L, 18L, 20L, 
    1L, 1L), .Label = c("male", "female", "c4", "upto22", "from22to24", 
    "from25to34", "from35to44", "from45to54", "from55to59", "from60to64", 
    "over65", "c20", "hispanic", "non_nispanic", "c42", "native", 
    "asian", "black", "hawaii", "white", "c48", "c49", "c50", 
    "c87", "c88", "c89", "c90", "c91", "c92", "c93"), class = "factor"), 
    value = c(12288, 5863, 8500, 10508, 8860, 12060, 11594, 9997, 
    8158, 6294)), row.names = c(NA, -10L), class = c("data.table", 
"data.frame"))

    Abbreviation       date year c1  State month total   variable value
 1:           AK  1/31/2011 2011  P Alaska    01 18395       male 12288
 2:           AK 10/31/2011 2011  P Alaska    10 10654     female  5863
 3:           AK 11/30/2011 2011  P Alaska    11 14113     upto22  8500
 4:           AK 12/31/2010 2010  P Alaska    12 16248 from22to24 10508
 5:           AK  4/30/2005 2005  P Alaska    04 14029   hispanic  8860
 6:           AK  2/28/2011 2011  P Alaska    02 17915      asian 12060
 7:           AK  3/31/2011 2011  P Alaska    03 17152      black 11594
 8:           AK  4/30/2011 2011  P Alaska    04 15543      white  9997
 9:           AK  5/31/2011 2011  P Alaska    05 13325       male  8158
10:           AK  6/30/2011 2011  P Alaska    06 11637       male  6294

variable具有三组变量。它们是sexageethnicity。所有组的总数总计(或多或少)相同。因此,male + female == totalblack + white + asian == total等。我想创建新变量,例如asian_male_upto22,该变量将是亚洲​​人的数量,乘以男性与总数的比例以及各个年龄段占总数的比例。

我正在寻找一种使该过程自动化的方法,但是我一直想办法。

我在考虑先将变量分配给组(A <- c("male", "female")),然后从中计算每个组的比率,但这似乎有些混乱。

有人能指出我正确的方向吗?

1 个答案:

答案 0 :(得分:1)

这是一个具有挑战性的问题。这就是我想出的(但是我确信还有改进的空间)。

如果我理解正确,则宽格式的数据集包含4个变量 sexageraceethnicity,例如,sex可以假定 femalemaleNA,依此类推。第6到26列包含每个value counts 变量不包括在内,但需要添加以构造的组。 As mentioned by the OP,列c4c20c42c48包含NA个计数,这些计数加到前面各列的计数中。 / p>

该处理包括几个步骤。步骤的第一部分将预处理数据,第二部分将创建新变量。

创建新变量有两种方法:

  • cross join
  • 或使用Reduce()递归。

预处理

(1)创建一个查找表,以将列和与它们所属的变量相关联。

lut <- data.table(value = names(DT))[
  , variable := value %>% 
    shift() %>% 
    like("c\\d{1,2}") %>% 
    cumsum() %>% 
    add(1L) %>% 
    extract(c("id", "sex", "age", "race", "ethn"),. )][]
lut
           value variable
 1: Abbreviation       id
 2:         date       id
 3:        month       id
 4:         year       id
 5:           c1       id
 6:         male      sex
 7:       female      sex
 8:           c4      sex
 9:       upto22      age
10:   from22to24      age
11:   from25to34      age
12:   from35to44      age
13:   from45to54      age
14:   from55to59      age
15:   from60to64      age
16:       over65      age
17:          c20      age
18:     hispanic     race
19: non_hispanic     race
20:          c42     race
21:       native     ethn
22:        asian     ethn
23:        black     ethn
24:       hawaii     ethn
25:        white     ethn
26:          c48     ethn
           value variable

(2)将字符date强制转换为数字日期,以省去多余的yearmonth列。此外,数字日期更便于订购或绘图。

DT[, date := as.IDate(date, "%m/%d/%Y")]

(3)将数据集从宽格式重整为长格式,从而删除列yearmonthc1。 (请不要将value.name函数的参数variable.namemelt()与我对变量 value 的定义混淆。)< br /> (4)通过 update join 将匹配的变量附加到每个
(5)将模糊的 c4c20等替换为NA
(6)将每个组的总数追加(包括NA个计数)。

long <- 
  melt(DT[, !c("year", "month", "c1")], id.vars = c("Abbreviation", "date"),
       value.name = "count", variable.name = "value")[
         lut, on = .(value), variable := i.variable][
           value %like% "c\\d{1,2}", value := NA][
             , total := sum(count), by = .(Abbreviation, date, variable)][]
long
    Abbreviation       date        value count variable total
 1:           AK 2011-01-31         male 12288      sex 18397
 2:           ZZ 2011-01-31         male 12298      sex 18427
 3:           AK 2011-01-31       female  6107      sex 18397
 4:           ZZ 2011-01-31       female  6117      sex 18427
 5:           AK 2011-01-31         <NA>     2      sex 18397
 6:           ZZ 2011-01-31         <NA>    12      sex 18427
 7:           AK 2011-01-31       upto22   870      age 18397
 8:           ZZ 2011-01-31       upto22   880      age 18487
 9:           AK 2011-01-31   from22to24  1441      age 18397
10:           ZZ 2011-01-31   from22to24  1451      age 18487
11:           AK 2011-01-31   from25to34  5320      age 18397
12:           ZZ 2011-01-31   from25to34  5330      age 18487
... 
31:           AK 2011-01-31       native  4856     ethn 18397
32:           ZZ 2011-01-31       native  4866     ethn 18457
33:           AK 2011-01-31        asian   791     ethn 18397
34:           ZZ 2011-01-31        asian   801     ethn 18457
35:           AK 2011-01-31        black   611     ethn 18397
36:           ZZ 2011-01-31        black   621     ethn 18457
37:           AK 2011-01-31       hawaii   289     ethn 18397
38:           ZZ 2011-01-31       hawaii   299     ethn 18457
39:           AK 2011-01-31        white 11209     ethn 18397
40:           ZZ 2011-01-31        white 11219     ethn 18457
41:           AK 2011-01-31         <NA>   641     ethn 18397
42:           ZZ 2011-01-31         <NA>   651     ethn 18457
    Abbreviation       date        value count variable total

通过交叉连接

创建新变量

(7)通过交叉连接 CJ()创建新变量的名称。交叉联接还将包括Abbreviationdate

new_vars <- 
  long[!is.na(value), CJ(Abbreviation, 
                         date, 
                         ethn = .SD[variable == "ethn", value], 
                         sex = .SD[variable == "sex", value], 
                         age = .SD[variable == "age", value],
                         unique = TRUE)][
                           , new.var := paste(ethn, sex, age, sep = "_")][]
new_vars
     Abbreviation       date   ethn    sex        age                 new.var
  1:           AK 2011-01-31 native   male     upto22      native_male_upto22
  2:           AK 2011-01-31 native   male from22to24  native_male_from22to24
  3:           AK 2011-01-31 native   male from25to34  native_male_from25to34
  4:           AK 2011-01-31 native   male from35to44  native_male_from35to44
  5:           AK 2011-01-31 native   male from45to54  native_male_from45to54
 ---                                                                         
156:           ZZ 2011-01-31  white female from35to44 white_female_from35to44
157:           ZZ 2011-01-31  white female from45to54 white_female_from45to54
158:           ZZ 2011-01-31  white female from55to59 white_female_from55to59
159:           ZZ 2011-01-31  white female from60to64 white_female_from60to64
160:           ZZ 2011-01-31  white female     over65     white_female_over65

(8)将new_vars重塑为长格式。这是必需的,因为原始数据集也已重塑为长格式。

lnv <- melt(new_vars, id.vars = c("Abbreviation", "date", "new.var"))
lnv
     Abbreviation       date                 new.var variable      value
  1:           AK 2011-01-31      native_male_upto22     ethn     native
  2:           AK 2011-01-31  native_male_from22to24     ethn     native
  3:           AK 2011-01-31  native_male_from25to34     ethn     native
  4:           AK 2011-01-31  native_male_from35to44     ethn     native
  5:           AK 2011-01-31  native_male_from45to54     ethn     native
 ---                                                                    
476:           ZZ 2011-01-31 white_female_from35to44      age from35to44
477:           ZZ 2011-01-31 white_female_from45to54      age from45to54
478:           ZZ 2011-01-31 white_female_from55to59      age from55to59
479:           ZZ 2011-01-31 white_female_from60to64      age from60to64
480:           ZZ 2011-01-31     white_female_over65      age     over65

(9)通过将new.varlnv右连接来追加long
(10)通过Abbreviationdatenew.var进行汇总,从而将计数相乘并除以总数(以获得份额)。

long[lnv, on = .(Abbreviation, date, variable, value)][
  , .(new.count = prod(count)/first(total)^2), by = .(Abbreviation, date, new.var)]
     Abbreviation       date                 new.var new.count
  1:           AK 2011-01-31      native_male_upto22 153.38579
  2:           AK 2011-01-31  native_male_from22to24 254.05623
  3:           AK 2011-01-31  native_male_from25to34 937.94527
  4:           AK 2011-01-31  native_male_from35to44 629.05803
  5:           AK 2011-01-31  native_male_from45to54 761.99238
 ---                                                          
156:           ZZ 2011-01-31 white_female_from35to44 720.79330
157:           ZZ 2011-01-31 white_female_from45to54 872.68769
158:           ZZ 2011-01-31 white_female_from55to59 312.04830
159:           ZZ 2011-01-31 white_female_from60to64 180.50050
160:           ZZ 2011-01-31     white_female_over65  92.86912

或者:使用Reduce()

递归地创建新变量

步骤(7)至(10)可以通过递归联接子集来代替。

如果手动完成,则外观如下:

long[!is.na(value) & variable == "ethn"][
  long[!is.na(value) & variable == "sex"], on = .(Abbreviation, date), allow.cartesian = TRUE,
  .(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)][
    long[!is.na(value) & variable == "age"], on = .(Abbreviation, date), allow.cartesian = TRUE,
    .(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)]

首先,将变量 ethn的子集与变量 sex的子集(外部联接)结合起来,从而计算出新变量名和新计数的一部分。然后将临时结果与变量 age的te子集合并,从而最终计算出新的变量名和新的计数。

这可以更一般地写为

join_fct <-   function(x, y) {
  x[y, on = .(Abbreviation, date), allow.cartesian = TRUE,
    .(Abbreviation, 
      date, 
      value = paste(value, i.value, sep ="_"), 
      count = count * i.count / i.total)]
}

Reduce(join_fct, 
       lapply(c("ethn", "sex", "age"), 
              function(x) long[!is.na(value) & variable == x])
)
     Abbreviation       date                value      count
  1:           AK 2011-01-31   native_male_upto22 153.385786
  2:           AK 2011-01-31    asian_male_upto22  24.985205
  3:           AK 2011-01-31    black_male_upto22  19.299571
  4:           AK 2011-01-31   hawaii_male_upto22   9.128602
  5:           AK 2011-01-31    white_male_upto22 354.057100
 ---                                                        
156:           ZZ 2011-01-31 native_female_over65  40.280090
157:           ZZ 2011-01-31  asian_female_over65   6.630570
158:           ZZ 2011-01-31  black_female_over65   5.140554
159:           ZZ 2011-01-31 hawaii_female_over65   2.475082
160:           ZZ 2011-01-31  white_female_over65  92.869365

此方法非常灵活,因为变量的数量和顺序可以轻松更改,例如,

Reduce(join_fct, lapply(c(“ race”,“ sex”), function(x)long [!is.na(value)&variable == x]) )

   Abbreviation       date               value      count
1:           AK 2011-01-31       hispanic_male   514.9779
2:           AK 2011-01-31   non_hispanic_male 11660.8090
3:           ZZ 2011-01-31       hispanic_male   521.2318
4:           ZZ 2011-01-31   non_hispanic_male 11657.9728
5:           AK 2011-01-31     hispanic_female   255.9383
6:           AK 2011-01-31 non_hispanic_female  5795.2930
7:           ZZ 2011-01-31     hispanic_female   259.2596
8:           ZZ 2011-01-31 non_hispanic_female  5798.6518

数据

OP已在原始数据集的行上提供(采用宽格式)

DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01", 
                     year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2, 
                     upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568, 
                     from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451, 
                     c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168, 
                     native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209, 
                     c48 = 641), row.names = c(NA, -1L), class = c("data.table", "data.frame"))

但是,要验证代码是否按预期工作,我需要第二行。所以我在第二行附加了:

library(data.table)
DT <- rbind(DT, DT)
DT[2, (6:ncol(DT)) := lapply(.SD, `+`, y = 10), .SDcols = 6:ncol(DT)]
DT[2, Abbreviation := "ZZ"]
DT
   Abbreviation      date month year c1  male female c4 upto22 from22to24 from25to34
1:           AK 1/31/2011    01 2011  P 12288   6107  2    870       1441       5320
2:           ZZ 1/31/2011    01 2011  P 12298   6117 12    880       1451       5330
   from35to44 from45to54 from55to59 from60to64 over65 c20 hispanic non_hispanic c42 native
1:       3568       4322       1539        886    451   0      771        17458 168   4856
2:       3578       4332       1549        896    461  10      781        17468 178   4866
   asian black hawaii white c48
1:   791   611    289 11209 641
2:   801   621    299 11219 651