scoobi连接DLists问题

时间:2015-03-24 10:30:56

标签: scala persist partition either

我有一个以下结构的scala scoobi 代码(我现在在本地执行):

def run() = {
  val myvalue1 = processSource.source1("mypath/inputData/", references) // has several subfolders with input files that should be processed and some lines in these files are to be rejected due to some reasons and written to a log
  val (partitionedVal1, partitionedVal2) =  myvalue1.partition {
    case Left(x) => true
    case _ => false}
  val output = processData.process(partitionedVal1) // some processing
  val getRidOfRight = partitionedVal2.map(x => x.right.get)
  persist(getRidOfRight.toTextFile("output/log", overwrite = true), output.toTextFile("output/processed_data", overwrite = true))

}

processSource是一个对象:

object processSource {
def source1(a:DList[Either[Class1, String]] , b:DList[Either[Class2, String]] ) = {
val (a1, a2) = a.partition {
        case Left(x) => true
        case _ => false}
val (b1, b2) = b.partition {
        case Left(x) => true
        case _ => false}
val joinedAB1:DList[Either[Class1, String]] = a1.by(_.myKey).joinLeft(b1.by(_.myOtherKey)).map {
case (key, (val1, Some(val2))) => Left(val1.withCommon(val2))
case (key,(val1, None)) => Left(val1)}

val AB2:DList[Either[Class1, String]] = b2.map{ x => Right(x.right.get)}

AB1 ++ AB2 ++ b1
   } 
}

现在,我的问题是getRidOfRight只返回它应该返回的部分输出。为什么?

1 个答案:

答案 0 :(得分:0)

github.com/NICTA/scoobi/issues/328它是scoobi中的一个错误 - 解决方案是使用scoobi 0.9.1和"收集"而不是"分区" ("分区"未修复)