我有这份工作:
import com.twitter.scalding.{Args, Csv, Job}
class ManagersAndTeams(args: Args) extends Job(args)
{
val managersPipe = Csv(args("managers"), skipHeader = true)
.project('managerID, 'teamID)
val teamsPipe = Csv(args("teams"), skipHeader = true)
.project('teamID, 'name)
.rename('teamID, 'teamID_)
managersPipe.joinWithLarger(('teamID, 'teamID_), teamsPipe)
.project('teamID, 'name, 'managerID)
.write(Csv(args("output"), writeHeader = true))
}
我正在尝试测试它。但是在测试期间它似乎没有读取csv头文件:
Caused by: cascading.tuple.TupleException: unable to select from: [UNKNOWN], using selector: ['managerID', 'teamID']
at cascading.tuple.Tuple.get(Tuple.java:364)
at cascading.flow.stream.OperatorStage$1.makeResult(OperatorStage.java:92)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:95)
at cascading.flow.stream.FunctionEachStage.receive(FunctionEachStage.java:39)
at cascading.flow.stream.SourceStage.map(SourceStage.java:102)
at cascading.flow.stream.SourceStage.call(SourceStage.java:53)
at cascading.flow.stream.SourceStage.call(SourceStage.java:38)
... 4 more
Caused by: cascading.tuple.FieldsResolverException: could not select fields: [{1}:'managerID'], from: [{?}:UNKNOWN]
at cascading.tuple.Fields.indexOf(Fields.java:1016)
at cascading.tuple.Fields.translatePos(Fields.java:957)
at cascading.tuple.Fields.getPos(Fields.java:939)
at cascading.tuple.Tuple.getPos(Tuple.java:373)
at cascading.tuple.Tuple.get(Tuple.java:360)
... 10 more
这是我的测试类:
import com.twitter.scalding.{Csv, JobTest}
import org.scalatest.FunSuite
import org.scalatest.Matchers._
class ManagersAndTeamsSuite extends FunSuite
{
test("joins") {
createJob(
List(
("managerID", "teamID", "x"),
("man1", "team1", "x1"),
("man2", "team2", "x2")
),
List(
("teamID", "name", "y"),
("team1", "the team 1", "y1"),
("team2", "the team 2", "y2")
)
) should be(List(
))
}
def createJob(
managers: List[(String, String, String)],
teams: List[(String, String, String)]
) = {
var r = List.empty[(String, String, String)]
new JobTest(new ManagersAndTeams(_))
.arg("managers", "managers-arg")
.arg("teams", "teams-arg")
.arg("output", "output-arg")
.source(Csv("managers-arg", skipHeader = true), managers)
.source(Csv("teams-arg", skipHeader = true), teams)
.sink[(String, String, String)](Csv("output-arg", writeHeader = true)) {
buffer =>
r = buffer.toList
}.run.finish
r
}
}
正如你所看到的,我在工作和测试中都得到了skipHeaders = true(我也尝试过没有它们的测试但得到同样的问题)。调试scalding /级联代码,似乎它不解析csv的头文件,如测试中所定义。关于如何解决这个问题的任何想法?
答案 0 :(得分:0)
它现在可以在测试模式下工作。一定是个bug。没有足够的时间来调试它。您可以通过本地模式中的烫印脚本查看它是如何工作的:https://gist.github.com/ceteri/4371896,同样适用于hdfs模式。需要将此文件作为错误+修复提交。