我的数据如下:
DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01",
year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2,
upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568,
from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451,
c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168,
native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209,
c48 = 641), row.names = c(NA, -1L), class = c("data.table",
"data.frame"))
DT <- structure(list(Abbreviation = c("AK", "AK", "AK", "AK", "AK",
"AK", "AK", "AK", "AK", "AK"), date = c("1/31/2011", "10/31/2011",
"11/30/2011", "12/31/2010", "4/30/2005", "2/28/2011", "3/31/2011",
"4/30/2011", "5/31/2011", "6/30/2011"), year = c("2011", "2011",
"2011", "2010", "2005", "2011", "2011", "2011", "2011", "2011"
), c1 = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P"),
State = c("Alaska", "Alaska", "Alaska", "Alaska", "Alaska",
"Alaska", "Alaska", "Alaska", "Alaska", "Alaska"), month = c("01",
"10", "11", "12", "04", "02", "03", "04", "05", "06"), total = c(18395,
10654, 14113, 16248, 14029, 17915, 17152, 15543, 13325, 11637
), variable = structure(c(1L, 2L, 4L, 5L, 13L, 17L, 18L, 20L,
1L, 1L), .Label = c("male", "female", "c4", "upto22", "from22to24",
"from25to34", "from35to44", "from45to54", "from55to59", "from60to64",
"over65", "c20", "hispanic", "non_nispanic", "c42", "native",
"asian", "black", "hawaii", "white", "c48", "c49", "c50",
"c87", "c88", "c89", "c90", "c91", "c92", "c93"), class = "factor"),
value = c(12288, 5863, 8500, 10508, 8860, 12060, 11594, 9997,
8158, 6294)), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
Abbreviation date year c1 State month total variable value
1: AK 1/31/2011 2011 P Alaska 01 18395 male 12288
2: AK 10/31/2011 2011 P Alaska 10 10654 female 5863
3: AK 11/30/2011 2011 P Alaska 11 14113 upto22 8500
4: AK 12/31/2010 2010 P Alaska 12 16248 from22to24 10508
5: AK 4/30/2005 2005 P Alaska 04 14029 hispanic 8860
6: AK 2/28/2011 2011 P Alaska 02 17915 asian 12060
7: AK 3/31/2011 2011 P Alaska 03 17152 black 11594
8: AK 4/30/2011 2011 P Alaska 04 15543 white 9997
9: AK 5/31/2011 2011 P Alaska 05 13325 male 8158
10: AK 6/30/2011 2011 P Alaska 06 11637 male 6294
列variable
具有三组变量。它们是sex
,age
和ethnicity
。所有组的总数总计(或多或少)相同。因此,male + female == total
,black + white + asian == total
等。我想创建新变量,例如asian_male_upto22
,该变量将是亚洲人的数量,乘以男性与总数的比例以及各个年龄段占总数的比例。
我正在寻找一种使该过程自动化的方法,但是我一直想办法。
我在考虑先将变量分配给组(A <- c("male", "female")
),然后从中计算每个组的比率,但这似乎有些混乱。
有人能指出我正确的方向吗?
答案 0 :(得分:1)
这是一个具有挑战性的问题。这就是我想出的(但是我确信还有改进的空间)。
如果我理解正确,则宽格式的数据集包含4个变量 sex
,age
,race
和ethnicity
,例如,sex
可以假定值 female
,male
或NA
,依此类推。第6到26列包含每个value
的 counts 。 变量不包括在内,但需要添加以构造值的组。 As mentioned by the OP,列c4
,c20
,c42
,c48
包含NA
个计数,这些计数加到前面各列的计数中。 / p>
该处理包括几个步骤。步骤的第一部分将预处理数据,第二部分将创建新变量。
创建新变量有两种方法:
cross join
Reduce()
递归。(1)创建一个查找表,以将列和值与它们所属的变量相关联。
lut <- data.table(value = names(DT))[
, variable := value %>%
shift() %>%
like("c\\d{1,2}") %>%
cumsum() %>%
add(1L) %>%
extract(c("id", "sex", "age", "race", "ethn"),. )][]
lut
value variable 1: Abbreviation id 2: date id 3: month id 4: year id 5: c1 id 6: male sex 7: female sex 8: c4 sex 9: upto22 age 10: from22to24 age 11: from25to34 age 12: from35to44 age 13: from45to54 age 14: from55to59 age 15: from60to64 age 16: over65 age 17: c20 age 18: hispanic race 19: non_hispanic race 20: c42 race 21: native ethn 22: asian ethn 23: black ethn 24: hawaii ethn 25: white ethn 26: c48 ethn value variable
(2)将字符date
强制转换为数字日期,以省去多余的year
和month
列。此外,数字日期更便于订购或绘图。
DT[, date := as.IDate(date, "%m/%d/%Y")]
(3)将数据集从宽格式重整为长格式,从而删除列year
,month
和c1
。 (请不要将value.name
函数的参数variable.name
和melt()
与我对变量和 value 的定义混淆。)< br />
(4)通过 update join 将匹配的变量附加到每个值。
(5)将模糊的值 c4
,c20
等替换为NA
(6)将每个组的总数追加(包括NA
个计数)。
long <-
melt(DT[, !c("year", "month", "c1")], id.vars = c("Abbreviation", "date"),
value.name = "count", variable.name = "value")[
lut, on = .(value), variable := i.variable][
value %like% "c\\d{1,2}", value := NA][
, total := sum(count), by = .(Abbreviation, date, variable)][]
long
Abbreviation date value count variable total 1: AK 2011-01-31 male 12288 sex 18397 2: ZZ 2011-01-31 male 12298 sex 18427 3: AK 2011-01-31 female 6107 sex 18397 4: ZZ 2011-01-31 female 6117 sex 18427 5: AK 2011-01-31 <NA> 2 sex 18397 6: ZZ 2011-01-31 <NA> 12 sex 18427 7: AK 2011-01-31 upto22 870 age 18397 8: ZZ 2011-01-31 upto22 880 age 18487 9: AK 2011-01-31 from22to24 1441 age 18397 10: ZZ 2011-01-31 from22to24 1451 age 18487 11: AK 2011-01-31 from25to34 5320 age 18397 12: ZZ 2011-01-31 from25to34 5330 age 18487 ... 31: AK 2011-01-31 native 4856 ethn 18397 32: ZZ 2011-01-31 native 4866 ethn 18457 33: AK 2011-01-31 asian 791 ethn 18397 34: ZZ 2011-01-31 asian 801 ethn 18457 35: AK 2011-01-31 black 611 ethn 18397 36: ZZ 2011-01-31 black 621 ethn 18457 37: AK 2011-01-31 hawaii 289 ethn 18397 38: ZZ 2011-01-31 hawaii 299 ethn 18457 39: AK 2011-01-31 white 11209 ethn 18397 40: ZZ 2011-01-31 white 11219 ethn 18457 41: AK 2011-01-31 <NA> 641 ethn 18397 42: ZZ 2011-01-31 <NA> 651 ethn 18457 Abbreviation date value count variable total
(7)通过交叉连接 CJ()
创建新变量的名称。交叉联接还将包括Abbreviation
和date
new_vars <-
long[!is.na(value), CJ(Abbreviation,
date,
ethn = .SD[variable == "ethn", value],
sex = .SD[variable == "sex", value],
age = .SD[variable == "age", value],
unique = TRUE)][
, new.var := paste(ethn, sex, age, sep = "_")][]
new_vars
Abbreviation date ethn sex age new.var 1: AK 2011-01-31 native male upto22 native_male_upto22 2: AK 2011-01-31 native male from22to24 native_male_from22to24 3: AK 2011-01-31 native male from25to34 native_male_from25to34 4: AK 2011-01-31 native male from35to44 native_male_from35to44 5: AK 2011-01-31 native male from45to54 native_male_from45to54 --- 156: ZZ 2011-01-31 white female from35to44 white_female_from35to44 157: ZZ 2011-01-31 white female from45to54 white_female_from45to54 158: ZZ 2011-01-31 white female from55to59 white_female_from55to59 159: ZZ 2011-01-31 white female from60to64 white_female_from60to64 160: ZZ 2011-01-31 white female over65 white_female_over65
(8)将new_vars
重塑为长格式。这是必需的,因为原始数据集也已重塑为长格式。
lnv <- melt(new_vars, id.vars = c("Abbreviation", "date", "new.var"))
lnv
Abbreviation date new.var variable value 1: AK 2011-01-31 native_male_upto22 ethn native 2: AK 2011-01-31 native_male_from22to24 ethn native 3: AK 2011-01-31 native_male_from25to34 ethn native 4: AK 2011-01-31 native_male_from35to44 ethn native 5: AK 2011-01-31 native_male_from45to54 ethn native --- 476: ZZ 2011-01-31 white_female_from35to44 age from35to44 477: ZZ 2011-01-31 white_female_from45to54 age from45to54 478: ZZ 2011-01-31 white_female_from55to59 age from55to59 479: ZZ 2011-01-31 white_female_from60to64 age from60to64 480: ZZ 2011-01-31 white_female_over65 age over65
(9)通过将new.var
与lnv
右连接来追加long
。
(10)通过Abbreviation
,date
和new.var
进行汇总,从而将计数相乘并除以总数(以获得份额)。
long[lnv, on = .(Abbreviation, date, variable, value)][
, .(new.count = prod(count)/first(total)^2), by = .(Abbreviation, date, new.var)]
Abbreviation date new.var new.count 1: AK 2011-01-31 native_male_upto22 153.38579 2: AK 2011-01-31 native_male_from22to24 254.05623 3: AK 2011-01-31 native_male_from25to34 937.94527 4: AK 2011-01-31 native_male_from35to44 629.05803 5: AK 2011-01-31 native_male_from45to54 761.99238 --- 156: ZZ 2011-01-31 white_female_from35to44 720.79330 157: ZZ 2011-01-31 white_female_from45to54 872.68769 158: ZZ 2011-01-31 white_female_from55to59 312.04830 159: ZZ 2011-01-31 white_female_from60to64 180.50050 160: ZZ 2011-01-31 white_female_over65 92.86912
Reduce()
步骤(7)至(10)可以通过递归联接子集来代替。
如果手动完成,则外观如下:
long[!is.na(value) & variable == "ethn"][
long[!is.na(value) & variable == "sex"], on = .(Abbreviation, date), allow.cartesian = TRUE,
.(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)][
long[!is.na(value) & variable == "age"], on = .(Abbreviation, date), allow.cartesian = TRUE,
.(Abbreviation, date, value = paste(value, i.value, sep ="_"), count = count * i.count / i.total)]
首先,将变量 ethn
的子集与变量 sex
的子集(外部联接)结合起来,从而计算出新变量名和新计数的一部分。然后将临时结果与变量 age
的te子集合并,从而最终计算出新的变量名和新的计数。
这可以更一般地写为
join_fct <- function(x, y) {
x[y, on = .(Abbreviation, date), allow.cartesian = TRUE,
.(Abbreviation,
date,
value = paste(value, i.value, sep ="_"),
count = count * i.count / i.total)]
}
Reduce(join_fct,
lapply(c("ethn", "sex", "age"),
function(x) long[!is.na(value) & variable == x])
)
Abbreviation date value count 1: AK 2011-01-31 native_male_upto22 153.385786 2: AK 2011-01-31 asian_male_upto22 24.985205 3: AK 2011-01-31 black_male_upto22 19.299571 4: AK 2011-01-31 hawaii_male_upto22 9.128602 5: AK 2011-01-31 white_male_upto22 354.057100 --- 156: ZZ 2011-01-31 native_female_over65 40.280090 157: ZZ 2011-01-31 asian_female_over65 6.630570 158: ZZ 2011-01-31 black_female_over65 5.140554 159: ZZ 2011-01-31 hawaii_female_over65 2.475082 160: ZZ 2011-01-31 white_female_over65 92.869365
此方法非常灵活,因为变量的数量和顺序可以轻松更改,例如,
Reduce(join_fct, lapply(c(“ race”,“ sex”), function(x)long [!is.na(value)&variable == x]) )
Abbreviation date value count 1: AK 2011-01-31 hispanic_male 514.9779 2: AK 2011-01-31 non_hispanic_male 11660.8090 3: ZZ 2011-01-31 hispanic_male 521.2318 4: ZZ 2011-01-31 non_hispanic_male 11657.9728 5: AK 2011-01-31 hispanic_female 255.9383 6: AK 2011-01-31 non_hispanic_female 5795.2930 7: ZZ 2011-01-31 hispanic_female 259.2596 8: ZZ 2011-01-31 non_hispanic_female 5798.6518
OP已在原始数据集的行上提供(采用宽格式)
DT <- structure(list(Abbreviation = "AK", date = "1/31/2011", month = "01",
year = "2011", c1 = "P", male = 12288, female = 6107, c4 = 2,
upto22 = 870, from22to24 = 1441, from25to34 = 5320, from35to44 = 3568,
from45to54 = 4322, from55to59 = 1539, from60to64 = 886, over65 = 451,
c20 = 0, hispanic = 771, non_hispanic = 17458, c42 = 168,
native = 4856, asian = 791, black = 611, hawaii = 289, white = 11209,
c48 = 641), row.names = c(NA, -1L), class = c("data.table", "data.frame"))
但是,要验证代码是否按预期工作,我需要第二行。所以我在第二行附加了:
library(data.table)
DT <- rbind(DT, DT)
DT[2, (6:ncol(DT)) := lapply(.SD, `+`, y = 10), .SDcols = 6:ncol(DT)]
DT[2, Abbreviation := "ZZ"]
DT
Abbreviation date month year c1 male female c4 upto22 from22to24 from25to34 1: AK 1/31/2011 01 2011 P 12288 6107 2 870 1441 5320 2: ZZ 1/31/2011 01 2011 P 12298 6117 12 880 1451 5330 from35to44 from45to54 from55to59 from60to64 over65 c20 hispanic non_hispanic c42 native 1: 3568 4322 1539 886 451 0 771 17458 168 4856 2: 3578 4332 1549 896 461 10 781 17468 178 4866 asian black hawaii white c48 1: 791 611 289 11209 641 2: 801 621 299 11219 651