Question

我想为不同的akka流实现sql merge join。例如，我有3个类：

case class A(id: String, as: String)
case class B(a_id: String, bs: String)
case class C(id: String, as: String, bs: String)

我有两个来源（Source[A]，Source[B]都按id和a_id排序）我希望Sink[C]以{{{ 1}}。我无法理解它是如何实施的。

溪流的例子：

id=a_id包含：A（1，＆＃34; a1＆＃34;），A（2，＆＃34; a2＆＃34;），A（3，＆＃34; a3_1＆＃34; ），A（3，＆＃34; a3_2＆＃34;），A（4，＆＃34; a4＆＃34;）

Source[A]包含：B（2，＆＃34; b2），B（3，＆＃34; b3＆＃34;）

Source[B]必须是：C（2，＆＃34; a2＆＃34;，＆＃34; b2），C（3，＆＃34; a3_1＆＃34;，＆＃34; b3＆＃ 34;），C（3，＆＃34; a3_2＆＃34;，＆＃34; b3＆＃34;）

Answer 1

以上示例对OneToManyMergeJoin规范不正确。正确的是：

case class A(id: Int, as: String)
case class B(a_id: Int, bs: String)
case class C(id: Int, as: String, bs: String)

val source1: Source[A, NotUsed] = Source(
  List(A(1, "a1"), A(2, "a2"), A(3, "a3"), A(4, "a4"))
)
val source2: Source[B, NotUsed] = Source(
  List(B(2, "b2"), B(3, "b3_1"), B(3, "b3_2"))
)

其中一个来源必须是DISTINCT，并且都按某些属性排序。所以我们可以使用

class OneToManyMergeJoin[Distinct, Duplicated, O](val zipper: (Distinct, Duplicated) ⇒ O, val comparator: (Distinct, Duplicated) => Int) extends GraphStage[FanInShape2[Distinct, Duplicated , O]]  {
  override val shape: FanInShape2[Distinct, Duplicated , O] = new FanInShape2("OneToManyMergeJoin")

  private val left = shape.in0
  private val right = shape.in1
  private val out = shape.out

  override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) with StageLogging {
    setHandler(left, ignoreTerminateInput)
    setHandler(right, ignoreTerminateInput)
    setHandler(out, eagerTerminateOutput)

    var leftValue: Distinct = _
    var rightValue: Duplicated = _

    def dispatch(l: Distinct, r: Duplicated): Unit = {

      val c = comparator(leftValue,rightValue)

      if (c == 0) {
        emit(out, zipper(leftValue,rightValue), readR)
      } else {
        if (c < 0) readL() else readR()
      }

    }

    private val dispatchR = { v: Duplicated =>
      rightValue = v
      dispatch(leftValue, rightValue)
    }

    private val dispatchL = { v: Distinct =>
      leftValue = v
      dispatch(leftValue, rightValue)
    }

    lazy val readR: () => Unit = () => read(right)(dispatchR, () => if (comparator(leftValue, rightValue) < 0) readL() else completeStage())
    lazy val readL: () => Unit = () => read(left)(dispatchL, () => if (comparator(leftValue, rightValue) > 0) readR() else completeStage())

    override def preStart(): Unit = {
      // all fan-in stages need to eagerly pull all inputs to get cycles started
      pull(right)
      read(left)(
        l => {
          leftValue = l
          read(right)(dispatchR, () => completeStage())
        },
        () => completeStage()
      )
    }

  }

}

作为Merge Join的实施equals predicate。

它可以与第一个Distinct source，第二个Duplicated source，zipper function和comparator function一起使用：

def oneToManyMergeJoin[Distinct, Duplicated, O](s1: Source[Distinct, NotUsed], s2: Source[Duplicated, NotUsed])(zipper: (Distinct, Duplicated) => O, comparator: (Distinct, Duplicated) => Int): Source[O, NotUsed] = {
  Source.fromGraph( GraphDSL.create() {implicit builder =>
    import GraphDSL.Implicits._

    val m = builder.add(new OneToManyMergeJoin[Distinct, Duplicated, O]( zipper, comparator))

    s1 ~> m.in0
    s2 ~> m.in1

    SourceShape(m.out)
  })
}

oneToManyMergeJoin(source1, source2)( (a, b) => C(a.id, a.as, b.bs), (a, b) => Ordering.Int.compare(a.id, b.a_id))
  .runWith( Sink.foreach[C](p => log.debug(s"Sink: $p") ))

结果：

DEBUG Sink: C(2,a2,b2)
DEBUG Sink: C(3,a3,b3_1)
DEBUG Sink: C(3,a3,b3_2)

为什么需要它？其中一个案例是合并cassandra而不是完全非规范化表（但已经排序）。

在akka源上实现sql merge join

1 个答案: