我有一个以下结构的scala scoobi 代码(我现在在本地执行):
def run() = {
val myvalue1 = processSource.source1("mypath/inputData/", references) // has several subfolders with input files that should be processed and some lines in these files are to be rejected due to some reasons and written to a log
val (partitionedVal1, partitionedVal2) = myvalue1.partition {
case Left(x) => true
case _ => false}
val output = processData.process(partitionedVal1) // some processing
val getRidOfRight = partitionedVal2.map(x => x.right.get)
persist(getRidOfRight.toTextFile("output/log", overwrite = true), output.toTextFile("output/processed_data", overwrite = true))
}
processSource是一个对象:
object processSource {
def source1(a:DList[Either[Class1, String]] , b:DList[Either[Class2, String]] ) = {
val (a1, a2) = a.partition {
case Left(x) => true
case _ => false}
val (b1, b2) = b.partition {
case Left(x) => true
case _ => false}
val joinedAB1:DList[Either[Class1, String]] = a1.by(_.myKey).joinLeft(b1.by(_.myOtherKey)).map {
case (key, (val1, Some(val2))) => Left(val1.withCommon(val2))
case (key,(val1, None)) => Left(val1)}
val AB2:DList[Either[Class1, String]] = b2.map{ x => Right(x.right.get)}
AB1 ++ AB2 ++ b1
}
}
现在,我的问题是getRidOfRight只返回它应该返回的部分输出。为什么?
答案 0 :(得分:0)
github.com/NICTA/scoobi/issues/328它是scoobi中的一个错误 - 解决方案是使用scoobi 0.9.1和"收集"而不是"分区" ("分区"未修复)