我有一些包含一些列的数据集(所有因子),我想最大化列的总和" Value_to_Dollar_Finance"。为此,我需要找到各种其他列及其级别的组合,以最大化" Value_to_Dollar_Finance"。 如果我明确地将列修复为:
,我可以找到级别的组合data_prep
dt <-data.table(data_prep[,-1])
setkeyv(dt,c("Age","Gender"))
dt[,sum(Value_to_Dollar_Finance),by=key(dt)]
在此,我已将列固定为&#34;年龄&#34;和&#34;性别&#34;所以我得到了这个输出:
Age Gender V1
1: 18 to 24 Female 3914
2: 18 to 24 Male 1648
3: 25 to 34 Female 5356
4: 25 to 34 Male 5356
5: 35 to 44 Female 2266
6: 35 to 44 Male 2060
7: 45 to 54 Female 5562
8: 45 to 54 Male 1236
9: 55 to 64 Female 206
10: 55 to 64 Male 412
11: 65+ Male 1030
排序我可以说年龄 - :45到54和性别:女性最大化我的变量&#34; Value_to_Dollar_Finance&#34;。 我想知道我怎样才能找出所有列(假设将HHI_Income添加到Key增加&#34; Value_to_Dollar_Finance&#34;)以用于最大化&#34; Value_to_Dollar_Finance&#34;。
以下是数据集的输入:
structure(list(user_id_64 = c(9.07342e+17, 4.63301e+18, 1.25057e+18,
3.7687e+17, 4.38708e+18, 6.06174e+18, 6.56232e+18, 3.7804e+18,
7.20452e+18, 8.0051e+18, 8.65669e+17, 8.4059e+18, 6.26951e+18,
1.04779e+18, 3.79093e+18, 4.55963e+18, 5.54535e+18, 2.8122e+18,
8.56233e+18, 7.29827e+17, 7.94754e+18, 2.15311e+17, 8.10245e+18,
8.39761e+18, 7.82216e+18, 1.74928e+18, 8.88483e+17, 7.31851e+18,
1.83674e+18, 5.31255e+18, 8.12717e+18, 4.89047e+17, 6.85394e+18,
3.93333e+18, 1.10588e+18, 1.17008e+18, 4.20943e+18, 5.79288e+18,
1.71195e+18, 8.37821e+18, 3.31628e+18, 3.1075e+18, 1.43078e+18,
8.78603e+18, 4.69163e+18, 7.30254e+18, 6.21261e+18, 2.27262e+18,
6.02168e+18, 3.49317e+18, 3.16143e+18, 1.61317e+18, 6.08074e+18,
8.50853e+18, 7.6479e+18, 8.13491e+18, 7.32427e+18, 8.02574e+18,
6.93734e+18, 4.81579e+17, 1.25689e+18, 1.26517e+18, 3.33812e+18,
4.12716e+18, 2.31695e+18, 3.77893e+18, 8.32529e+18, 7.89111e+18,
6.30124e+17, 5.84101e+17, 8.94783e+18, 7.88774e+18, 2.30005e+17,
1.6935e+18, 4.66029e+18, 4.63604e+17, 4.18723e+18, 9.07208e+18,
7.5426e+18, 4.41737e+18, 4.61709e+18, 5.87117e+18, 5.24036e+18,
6.57733e+18, 7.16735e+18, 3.2182e+18, 2.7689e+17, 3.42698e+18,
1.35236e+18, 6.62158e+17, 4.3897e+18, 2.8965e+18, 3.54381e+18,
4.67134e+18, 6.08533e+18, 4.74586e+18, 6.33812e+18, 3.17199e+18
), Age = structure(c(3L, 5L, 4L, 6L, 3L, 5L, 5L, 4L, 4L, 2L,
7L, 3L, 3L, 7L, 2L, 5L, 2L, 7L, 5L, 4L, 4L, 2L, 5L, 3L, 2L, 3L,
5L, 2L, 3L, 3L, 5L, 2L, 3L, 3L, 3L, 2L, 4L, 2L, 3L, 3L, 4L, 3L,
5L, 4L, 3L, 2L, 3L, 3L, 5L, 4L, 3L, 6L, 3L, 3L, 5L, 3L, 2L, 5L,
3L, 3L, 5L, 2L, 4L, 7L, 5L, 3L, 5L, 4L, 5L, 2L, 3L, 5L, 2L, 5L,
5L, 3L, 3L, 2L, 5L, 3L, 5L, 7L, 5L, 5L, 4L, 3L, 2L, 5L, 3L, 3L,
4L, 5L, 4L, 3L, 3L, 3L, 4L, 2L), .Label = c("", "18 to 24", "25 to 34",
"35 to 44", "45 to 54", "55 to 64", "65+"), class = "factor"),
Gender = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L,
3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L,
2L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L), .Label = c("",
"Female", "Male"), class = "factor"), Relationship_Status = structure(c(3L,
3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 4L, 3L, 4L, 4L, 3L, 4L, 3L,
4L, 4L, 3L, 4L, 4L, 4L, 3L, 2L, 4L, 2L, 3L, 2L, 4L, 3L, 4L,
4L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L,
3L, 4L, 3L, 2L, 4L, 3L, 4L, 4L, 3L, 4L, 4L, 3L, 2L, 4L, 3L,
4L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 4L,
4L, 4L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 2L, 3L,
3L, 3L, 4L, 4L, 2L, 3L, 4L), .Label = c("", "In a Relationship",
"Married", "Single"), class = "factor"), HHI_Income = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 2L, 2L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
7L, 9L, 9L, 9L, 9L, 9L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 3L, 3L, 3L), .Label = c("", "£10k-£15k",
"£15-£20k", "£20-£25k", "£30-£40k", "£40-£50k", "£50-£75k",
"£75k+", "Under £10k"), class = "factor"), Value_to_Dollar_Finance = c(206L,
412L, 618L, 206L, 824L, 412L, 206L, 206L, 206L, 206L, 206L,
206L, 824L, 206L, 206L, 206L, 206L, 206L, 206L, 206L, 206L,
206L, 206L, 206L, 412L, 206L, 206L, 206L, 412L, 206L, 206L,
824L, 206L, 206L, 206L, 206L, 412L, 618L, 206L, 206L, 206L,
206L, 206L, 206L, 206L, 206L, 206L, 206L, 412L, 206L, 618L,
412L, 824L, 206L, 824L, 206L, 206L, 206L, 206L, 206L, 206L,
206L, 412L, 206L, 206L, 412L, 206L, 206L, 206L, 1030L, 206L,
412L, 206L, 206L, 206L, 206L, 824L, 206L, 206L, 206L, 412L,
206L, 206L, 206L, 412L, 412L, 206L, 206L, 206L, 206L, 206L,
206L, 206L, 412L, 206L, 206L, 412L, 206L)), .Names = c("user_id_64",
"Age", "Gender", "Relationship_Status", "HHI_Income", "Value_to_Dollar_Finance"
), class = "data.frame", row.names = c(30L, 31L, 39L, 43L, 69L,
86L, 118L, 143L, 169L, 183L, 195L, 196L, 236L, 247L, 259L, 279L,
304L, 313L, 347L, 359L, 461L, 472L, 525L, 585L, 705L, 887L, 910L,
920L, 931L, 946L, 1003L, 1010L, 1016L, 1019L, 1026L, 1030L, 1032L,
1040L, 1050L, 1105L, 1159L, 1174L, 1179L, 1257L, 1259L, 1268L,
1286L, 1302L, 1376L, 1458L, 1762L, 1874L, 1989L, 2116L, 2123L,
2129L, 2137L, 2138L, 2144L, 2179L, 2193L, 2225L, 2471L, 2522L,
2548L, 2595L, 2596L, 2788L, 2796L, 2797L, 2855L, 2926L, 2938L,
2984L, 3015L, 3050L, 3055L, 3065L, 3068L, 3073L, 3075L, 3078L,
3082L, 3084L, 3085L, 3101L, 3103L, 3105L, 3114L, 3122L, 3140L,
3157L, 3206L, 3244L, 3273L, 3384L, 3395L, 3405L))
答案 0 :(得分:1)
根据您希望看到的内容(例如,只根据分类最大化“Value_to_Dollar_Finance”的组合),您可以使用以下解决方案:
require(data.table)
dt <-data.table(data_prep[,-1])
setkeyv(dt,c("Age","Gender","HHI_Income"))
data.aggregated <- dt[,sum(Value_to_Dollar_Finance),by=key(dt)]
data.aggregated[which(data.aggregated$V1==max(data.aggregated$V1)),]
# Age Gender HHI_Income V1
# 1: 25 to 34 Female £10k-£15k 2060
希望这就是你要找的东西。
编辑:
或者,如果您想知道哪种组合给出了哪个最大值,您可以选择“强力”解决方案
require(data.table)
dt <-data.table(data_prep[,-1])
variables <- c("Age","Gender","Relationship_Status","HHI_Income")
combinations <- apply(X=as.matrix(1:length(variables)),MARGIN=1,FUN=function(X){combn(x=variables,m=X)})
for(i in 1:length(combinations)){for(j in 1:dim(combinations[[i]])[2]){setkeyv(dt,combinations[[i]][,j]) ; data.aggregated <- dt[,sum(Value_to_Dollar_Finance),by=key(dt)] ; print(combinations[[i]][,j]) ; print(data.aggregated[which(data.aggregated$V1==max(data.aggregated$V1)),])}}
然后比较输出。
答案 1 :(得分:0)
我建议采用一种更简单的方法 - 为什么不使用回归来帮助你呢?
您可以回归感兴趣变量的所有因子并解释系数以最大化结果:
require(data.table)
dt <-data.table(data_prep[,-1])
setkeyv(dt,c("Age","Gender"))
dt[,sum(Value_to_Dollar_Finance),by=key(dt)]
glm(Value_to_Dollar_Finance~. , data = dt)
这告诉我要最大化结果,我选择:
18 to 24
Female
的性别Single
£30-£40k