我正在做一个深度学习项目,其中有大量客户集合中的数据集(将近1000万)。我正在根据要求过滤所有“客户”列。几乎每个过滤的列都是字符串。我不能将索引放在每列(35列)上,因为这不是一个好主意。还有一些复杂的查询,例如与组聚合。
{
"_id" : ObjectId("5ca35824a7ad6a17e9c6eeb7"),
"batchId" : 1,
"demographicsState" : "Minnesota",
"demographicsGender" : "Female",
"jobCount" : "0 to 6",
"jobCreated" : "No",
"callResolution" : "No",
"customerEffortScore" : 2,
"phoneAccessibility" : "90 to 100",
"callRepTime" : "Just right",
"hadPriorCallsPastThirtyFiveDays" : "Yes",
"autoDebitFlag" : "No",
"servcoName" : "Monitronics",
"demographicsAge" : "45 to 54",
"checkedWebsiteFirst" : "No",
"alarmRelated" : "12-Sensor",
"reasonPrimary" : "19-Alarm, system or equipment related reason",
"inInitialTerm" : "Yes",
"callDuration" : "10 to 19",
"siteKind" : "Residential",
"customerSiteTenureDays" : "326",
"highRisk" : "No",
"monthsLeftUntilContractRenewal" : "26",
"nielsen" : "Savvy suburbs",
"callReason" : "Customer tech support",
"serviceScheduled" : "-",
"hadPriorCallsPastFiveDays" : "Yes",
"dropped" : "No",
"serviceResolution" : "80 to 89",
"dept" : 190,
"serviceRepresentative" : "90 to 100",
"demographicsIncome" : "50,000 - 74,999",
"aarpMember" : "No",
"rmr" : 44.99,
"satisfactionOverall" : 9,
"dropYes" : 1,
"dropNo" : 0,
"cltv" : 4146.578333333334
}
这是我需要获取数据的查询:
db.customers.aggregate(
[{$match:[
{$and:[
{"demographicsState": "Minnesota"},
{"demographicsGender": "Female"},
{"jobCount": "0 to 6"},
{"jobCreated":"Yes"},
{"callResolution": "No"},
{"customerEffortScore": {"$gt":0 "$lt": 8}},
{"phoneAccessibility": "50 to 60"},
{"hadPriorCallsPastThirtyFiveDays": "No"},
{"autoDebitFlag": "Yes"},
{"alarmRelated": "10-Sensor"},
{"callDuration": "20 to 29"},
{"hadPriorCallsPastFiveDays": "Yes"},
{"demographicsIncome":"50,000-74,999"},
{"aarpMember": "Yes"},
{"rmr": {"$gt": 30 $lt: 50 }},
{"dropYes":1}
]
},
{"$group":{"_id": "$demographicsGender", "count":{"$sum":1} }}]}])
我正在按照客户表的上述架构中的每一列进行过滤和分组。请让我知道,如果有人有什么想法。