所以这更像是一个设计问题。
现在,我有一份患者ID列表,我需要将它们放入3个桶中。
他们进入的桶完全基于以下RDD
case class Diagnostic(patientID:String, date: Date, code: String)
case class LabResult(patientID: String, date: Date, testName: String, value: Double)
case class Medication(patientID: String, date: Date, medicine: String)
现在我基本上每个RDD每次RDD 3-4次,以查看它是否进入水桶。这运行得非常慢,我能做些什么来改善这个?
示例是针对存储桶1,我必须检查是否存在诊断,对于patient_id 1(即使存在多个),代码为1并且patient_id 1具有药物,其中药物为foo
现在我将此作为两个过滤器(每个RDD上一个)....
丑陋的代码示例
if (labResult.filter({ lab =>
val testName = lab.testName
testName.contains("glucose")
}).count == 0) {
return false
} else if (labResult.filter({ lab =>
val testName = lab.testName
val testValue = lab.value
// all the built in rules
(testName == "hba1c" && testValue >= 6.0) ||
(testName == "hemoglobin a1c" && testValue >= 6.0) ||
(testName == "fasting glucose" && testValue >= 110) ||
(testName == "fasting blood glucose" && testValue >= 110) ||
(testName == "glucose" && testValue >= 110) ||
(testName == "glucose, serum" && testValue >= 110)
}).count > 0) {
return false
} else if (diagnostic.filter({ diagnosis =>
val code = diagnosis.code
(code == "790.21") ||
(code == "790.22") ||
(code == "790.2") ||
(code == "790.29") ||
(code == "648.81") ||
(code == "648.82") ||
(code == "648.83") ||
(code == "648.84") ||
(code == "648.0") ||
(code == "648.01") ||
(code == "648.02") ||
(code == "648.03") ||
(code == "648.04") ||
(code == "791.5") ||
(code == "277.7") ||
(code == "v77.1") ||
(code == "256.4") ||
(code == "250.*")
}).count > 0) {
return false
}
true