我有一个data.frame结构如下:
location gender 15.19 20.30 31.40 41.64 65.
New York Female 2 41 13 19 1
New York Male 1 23 15 17 2
San Francisco Female 1 27 14 14 3
San Francisco Male 4 24 14 10 1
Mexico City Female 1 40 26 11 3
Mexico City Male 4 23 35 8 3
Paris Female 2 12 10 6 0
Paris Male 1 20 13 11 1
...并且需要将其转换为比例表,每个单元格代表其给定城市的两行比例。 This是一种解决方案,但有一种更简单的方法可以为多个列执行此操作(并且只需转换它们而不需要创建新列吗?)
EDIT 正确的输出将使每个单元格成为该城市内所有单元格的一部分,以便所有共享位置的单元格“纽约”。最多可加1,所有共享位置的单元格,旧金山等等,即:
location gender 15.19 20.30 31.40 41.64 65.
New York Female .01 .31 .1 .14 .01
New York Male .01 .17 .11 .13 .02
答案 0 :(得分:2)
library("data.table")
selected_cols <- colnames(df)[3:7]
setDT(df)[, prop.table(.SD), by = location, .SDcols = selected_cols]
# location X15.19 X20.30 X31.40 X41.64 X65.
# 1: New York 0.014925373 0.3059701 0.09701493 0.14179104 0.007462687
# 2: New York 0.007462687 0.1716418 0.11194030 0.12686567 0.014925373
# 3: San Francisco 0.008928571 0.2410714 0.12500000 0.12500000 0.026785714
# 4: San Francisco 0.035714286 0.2142857 0.12500000 0.08928571 0.008928571
# 5: Mexico City 0.006493506 0.2597403 0.16883117 0.07142857 0.019480519
# 6: Mexico City 0.025974026 0.1493506 0.22727273 0.05194805 0.019480519
# 7: Paris 0.026315789 0.1578947 0.13157895 0.07894737 0.000000000
# 8: Paris 0.013157895 0.2631579 0.17105263 0.14473684 0.013157895
验证结果:每个城市的总和是1
a1 <- setDT(df)[, prop.table(.SD), by = location, .SDcols = selected_cols]
sum(subset(a1, location == "New York", select = selected_cols))
# [1] 1
然后,性别列可以与a1
do.call(cbind, list(gender = df$gender, a1))
数据:强>
dput(df)
structure(list(location = c("New York", "New York", "San Francisco",
"San Francisco", "Mexico City", "Mexico City", "Paris", "Paris"
), gender = c("Female", "Male", "Female", "Male", "Female", "Male",
"Female", "Male"), X15.19 = c(2L, 1L, 1L, 4L, 1L, 4L, 2L, 1L),
X20.30 = c(41L, 23L, 27L, 24L, 40L, 23L, 12L, 20L), X31.40 = c(13L,
15L, 14L, 14L, 26L, 35L, 10L, 13L), X41.64 = c(19L, 17L,
14L, 10L, 11L, 8L, 6L, 11L), X65. = c(1L, 2L, 3L, 1L, 3L,
3L, 0L, 1L)), .Names = c("location", "gender", "X15.19",
"X20.30", "X31.40", "X41.64", "X65."), row.names = c(NA, -8L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000200788>)
修改强>
根据@ Sumedh在以下评论中的建议,性别列不会丢失,所有selected_cols
都必须为double
类型。这是因为prop.table
属于double
类型的比例。否则,在RHS = LHS data type coercion error
上按引用:=
进行分配时,会出现selected_cols
。
setDT(df)[, (selected_cols) := prop.table(.SD), by = location, .SDcols = selected_cols]