Question

所以这更像是一个设计问题。

现在，我有一份患者ID列表，我需要将它们放入3个桶中。

他们进入的桶完全基于以下RDD

case class Diagnostic(patientID:String, date: Date, code: String)
case class LabResult(patientID: String, date: Date, testName: String, value: Double)
case class Medication(patientID: String, date: Date, medicine: String)

现在我基本上每个RDD每次RDD 3-4次，以查看它是否进入水桶。这运行得非常慢，我能做些什么来改善这个？

示例是针对存储桶1，我必须检查是否存在诊断，对于patient_id 1（即使存在多个），代码为1并且patient_id 1具有药物，其中药物为foo

现在我将此作为两个过滤器（每个RDD上一个）....

丑陋的代码示例

if (labResult.filter({ lab =>
        val testName = lab.testName

        testName.contains("glucose")
      }).count == 0) {
      return false
    } else if (labResult.filter({ lab =>
      val testName = lab.testName
      val testValue = lab.value

      // all the built in rules
      (testName == "hba1c" && testValue >= 6.0) ||
      (testName == "hemoglobin a1c" && testValue >= 6.0) ||
      (testName == "fasting glucose" && testValue >= 110) ||
      (testName == "fasting blood glucose" && testValue >= 110) ||
      (testName == "glucose" && testValue >= 110) ||
      (testName == "glucose, serum" && testValue >= 110)
    }).count > 0) {
      return false
    } else if (diagnostic.filter({ diagnosis =>
      val code = diagnosis.code

      (code == "790.21")  ||
      (code == "790.22")  ||
      (code == "790.2")   ||
      (code == "790.29")  ||
      (code == "648.81")  ||
      (code == "648.82")  ||
      (code == "648.83")  ||
      (code == "648.84")  ||
      (code == "648.0")   ||
      (code == "648.01")  ||
      (code == "648.02")  ||
      (code == "648.03")  ||
      (code == "648.04")  ||
      (code == "791.5")   ||
      (code == "277.7")   ||
      (code == "v77.1")   ||
      (code == "256.4")   ||
      (code == "250.*")
    }).count > 0) {
      return false
    }

    true

Apache Spark：优化过滤器

0 个答案: