Question

我有一些包含一些列的数据集（所有因子），我想最大化列的总和＆＃34; Value_to_Dollar_Finance＆＃34;。为此，我需要找到各种其他列及其级别的组合，以最大化＆＃34; Value_to_Dollar_Finance＆＃34;。如果我明确地将列修复为：

，我可以找到级别的组合

data_prep
dt <-data.table(data_prep[,-1])
setkeyv(dt,c("Age","Gender"))
dt[,sum(Value_to_Dollar_Finance),by=key(dt)]

在此，我已将列固定为＆＃34;年龄＆＃34;和＆＃34;性别＆＃34;所以我得到了这个输出：

       Age Gender   V1
 1: 18 to 24 Female 3914
 2: 18 to 24   Male 1648
 3: 25 to 34 Female 5356
 4: 25 to 34   Male 5356
 5: 35 to 44 Female 2266
 6: 35 to 44   Male 2060
 7: 45 to 54 Female 5562
 8: 45 to 54   Male 1236
 9: 55 to 64 Female  206
10: 55 to 64   Male  412
11:      65+   Male 1030

排序我可以说年龄 - ：45到54和性别：女性最大化我的变量＆＃34; Value_to_Dollar_Finance＆＃34;。我想知道我怎样才能找出所有列（假设将HHI_Income添加到Key增加＆＃34; Value_to_Dollar_Finance＆＃34;）以用于最大化＆＃34; Value_to_Dollar_Finance＆＃34;。

以下是数据集的输入：

structure(list(user_id_64 = c(9.07342e+17, 4.63301e+18, 1.25057e+18, 
3.7687e+17, 4.38708e+18, 6.06174e+18, 6.56232e+18, 3.7804e+18, 
7.20452e+18, 8.0051e+18, 8.65669e+17, 8.4059e+18, 6.26951e+18, 
1.04779e+18, 3.79093e+18, 4.55963e+18, 5.54535e+18, 2.8122e+18, 
8.56233e+18, 7.29827e+17, 7.94754e+18, 2.15311e+17, 8.10245e+18, 
8.39761e+18, 7.82216e+18, 1.74928e+18, 8.88483e+17, 7.31851e+18, 
1.83674e+18, 5.31255e+18, 8.12717e+18, 4.89047e+17, 6.85394e+18, 
3.93333e+18, 1.10588e+18, 1.17008e+18, 4.20943e+18, 5.79288e+18, 
1.71195e+18, 8.37821e+18, 3.31628e+18, 3.1075e+18, 1.43078e+18, 
8.78603e+18, 4.69163e+18, 7.30254e+18, 6.21261e+18, 2.27262e+18, 
6.02168e+18, 3.49317e+18, 3.16143e+18, 1.61317e+18, 6.08074e+18, 
8.50853e+18, 7.6479e+18, 8.13491e+18, 7.32427e+18, 8.02574e+18, 
6.93734e+18, 4.81579e+17, 1.25689e+18, 1.26517e+18, 3.33812e+18, 
4.12716e+18, 2.31695e+18, 3.77893e+18, 8.32529e+18, 7.89111e+18, 
6.30124e+17, 5.84101e+17, 8.94783e+18, 7.88774e+18, 2.30005e+17, 
1.6935e+18, 4.66029e+18, 4.63604e+17, 4.18723e+18, 9.07208e+18, 
7.5426e+18, 4.41737e+18, 4.61709e+18, 5.87117e+18, 5.24036e+18, 
6.57733e+18, 7.16735e+18, 3.2182e+18, 2.7689e+17, 3.42698e+18, 
1.35236e+18, 6.62158e+17, 4.3897e+18, 2.8965e+18, 3.54381e+18, 
4.67134e+18, 6.08533e+18, 4.74586e+18, 6.33812e+18, 3.17199e+18
), Age = structure(c(3L, 5L, 4L, 6L, 3L, 5L, 5L, 4L, 4L, 2L, 
7L, 3L, 3L, 7L, 2L, 5L, 2L, 7L, 5L, 4L, 4L, 2L, 5L, 3L, 2L, 3L, 
5L, 2L, 3L, 3L, 5L, 2L, 3L, 3L, 3L, 2L, 4L, 2L, 3L, 3L, 4L, 3L, 
5L, 4L, 3L, 2L, 3L, 3L, 5L, 4L, 3L, 6L, 3L, 3L, 5L, 3L, 2L, 5L, 
3L, 3L, 5L, 2L, 4L, 7L, 5L, 3L, 5L, 4L, 5L, 2L, 3L, 5L, 2L, 5L, 
5L, 3L, 3L, 2L, 5L, 3L, 5L, 7L, 5L, 5L, 4L, 3L, 2L, 5L, 3L, 3L, 
4L, 5L, 4L, 3L, 3L, 3L, 4L, 2L), .Label = c("", "18 to 24", "25 to 34", 
"35 to 44", "45 to 54", "55 to 64", "65+"), class = "factor"), 
    Gender = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    3L, 3L, 2L, 2L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 
    3L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 
    2L, 3L, 3L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    2L, 3L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 2L, 2L, 
    2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L), .Label = c("", 
    "Female", "Male"), class = "factor"), Relationship_Status = structure(c(3L, 
    3L, 3L, 3L, 2L, 3L, 3L, 2L, 3L, 4L, 3L, 4L, 4L, 3L, 4L, 3L, 
    4L, 4L, 3L, 4L, 4L, 4L, 3L, 2L, 4L, 2L, 3L, 2L, 4L, 3L, 4L, 
    4L, 3L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 
    3L, 4L, 3L, 2L, 4L, 3L, 4L, 4L, 3L, 4L, 4L, 3L, 2L, 4L, 3L, 
    4L, 3L, 3L, 3L, 4L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 
    4L, 4L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 2L, 3L, 
    3L, 3L, 4L, 4L, 2L, 3L, 4L), .Label = c("", "In a Relationship", 
    "Married", "Single"), class = "factor"), HHI_Income = structure(c(2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 
    4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
    4L, 4L, 4L, 4L, 2L, 2L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
    7L, 9L, 9L, 9L, 9L, 9L, 7L, 7L, 7L, 7L, 8L, 8L, 9L, 9L, 9L, 
    9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
    9L, 9L, 9L, 9L, 3L, 3L, 3L), .Label = c("", "£10k-£15k", 
    "£15-£20k", "£20-£25k", "£30-£40k", "£40-£50k", "£50-£75k", 
    "£75k+", "Under £10k"), class = "factor"), Value_to_Dollar_Finance = c(206L, 
    412L, 618L, 206L, 824L, 412L, 206L, 206L, 206L, 206L, 206L, 
    206L, 824L, 206L, 206L, 206L, 206L, 206L, 206L, 206L, 206L, 
    206L, 206L, 206L, 412L, 206L, 206L, 206L, 412L, 206L, 206L, 
    824L, 206L, 206L, 206L, 206L, 412L, 618L, 206L, 206L, 206L, 
    206L, 206L, 206L, 206L, 206L, 206L, 206L, 412L, 206L, 618L, 
    412L, 824L, 206L, 824L, 206L, 206L, 206L, 206L, 206L, 206L, 
    206L, 412L, 206L, 206L, 412L, 206L, 206L, 206L, 1030L, 206L, 
    412L, 206L, 206L, 206L, 206L, 824L, 206L, 206L, 206L, 412L, 
    206L, 206L, 206L, 412L, 412L, 206L, 206L, 206L, 206L, 206L, 
    206L, 206L, 412L, 206L, 206L, 412L, 206L)), .Names = c("user_id_64", 
"Age", "Gender", "Relationship_Status", "HHI_Income", "Value_to_Dollar_Finance"
), class = "data.frame", row.names = c(30L, 31L, 39L, 43L, 69L, 
86L, 118L, 143L, 169L, 183L, 195L, 196L, 236L, 247L, 259L, 279L, 
304L, 313L, 347L, 359L, 461L, 472L, 525L, 585L, 705L, 887L, 910L, 
920L, 931L, 946L, 1003L, 1010L, 1016L, 1019L, 1026L, 1030L, 1032L, 
1040L, 1050L, 1105L, 1159L, 1174L, 1179L, 1257L, 1259L, 1268L, 
1286L, 1302L, 1376L, 1458L, 1762L, 1874L, 1989L, 2116L, 2123L, 
2129L, 2137L, 2138L, 2144L, 2179L, 2193L, 2225L, 2471L, 2522L, 
2548L, 2595L, 2596L, 2788L, 2796L, 2797L, 2855L, 2926L, 2938L, 
2984L, 3015L, 3050L, 3055L, 3065L, 3068L, 3073L, 3075L, 3078L, 
3082L, 3084L, 3085L, 3101L, 3103L, 3105L, 3114L, 3122L, 3140L, 
3157L, 3206L, 3244L, 3273L, 3384L, 3395L, 3405L))

Answer 1

根据您希望看到的内容（例如，只根据分类最大化“Value_to_Dollar_Finance”的组合），您可以使用以下解决方案：

require(data.table)
dt <-data.table(data_prep[,-1])
setkeyv(dt,c("Age","Gender","HHI_Income"))
data.aggregated <- dt[,sum(Value_to_Dollar_Finance),by=key(dt)]
data.aggregated[which(data.aggregated$V1==max(data.aggregated$V1)),]

#         Age Gender HHI_Income   V1
# 1: 25 to 34 Female  £10k-£15k 2060

希望这就是你要找的东西。

编辑：

或者，如果您想知道哪种组合给出了哪个最大值，您可以选择“强力”解决方案

require(data.table)
dt <-data.table(data_prep[,-1])

variables <- c("Age","Gender","Relationship_Status","HHI_Income")
combinations <- apply(X=as.matrix(1:length(variables)),MARGIN=1,FUN=function(X){combn(x=variables,m=X)})
for(i in 1:length(combinations)){for(j in 1:dim(combinations[[i]])[2]){setkeyv(dt,combinations[[i]][,j]) ; data.aggregated <- dt[,sum(Value_to_Dollar_Finance),by=key(dt)] ; print(combinations[[i]][,j]) ; print(data.aggregated[which(data.aggregated$V1==max(data.aggregated$V1)),])}}

然后比较输出。

Answer 2

我建议采用一种更简单的方法 - 为什么不使用回归来帮助你呢？

您可以回归感兴趣变量的所有因子并解释系数以最大化结果：

require(data.table)
dt <-data.table(data_prep[,-1])
setkeyv(dt,c("Age","Gender"))
dt[,sum(Value_to_Dollar_Finance),by=key(dt)]

glm(Value_to_Dollar_Finance~. , data = dt)

这告诉我要最大化结果，我选择：

年龄18 to 24
Female的性别
Single
HHI_收入£30-£40k

根据列集最大化变量值

2 个答案: