Scala MapReduce框架提供类型不匹配

时间:2014-11-05 21:00:09

标签: java scala hadoop mapreduce

我在Scala中有一个基于org.apache.hadoop库的MapReduce框架。它适用于简单的wordcount程序。但是,我想将它应用于有用的东西,并且遇到了障碍。我想获取一个csv文件(或任何分隔符)并传递第一列中的任何内容作为键,然后计算键的发生率。

映射器代码如下所示

class WordCountMapper extends Mapper[LongWritable, Text, Text, LongWritable] with HImplicits {
  protected override def map(lnNumber: LongWritable, line: Text, context: Mapper[LongWritable, Text, Text, LongWritable]#Context): Unit = {
  line.split(",", -1)(0) foreach (context.write(_,1))  //Splits data
  }
}

问题出在' line.split'码。当我尝试编译它时,我得到一个错误,上面写着:

发现:char 要求:org.apache.hadoop.io.Text

line.split ...应该返回一个在write(_,1)中传递给_的字符串,但是对于soem来说,它认为它是一个char。我甚至添加了.toString来明确地将它变成一个字符串但是它也没有用。

任何想法都表示赞赏。让我知道我可以提供哪些其他细节。

更新

以下是导入列表:

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.{Reducer, Job, Mapper}
import org.apache.hadoop.conf.{Configured}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import scala.collection.JavaConversions._
import org.apache.hadoop.util.{ToolRunner, Tool}

这是build.sbt代码:

import AssemblyKeys._ // put this at the top of the file

assemblySettings

organization := "scala"

name := "WordCount"

version := "1.0"

scalaVersion:= "2.11.2"

scalacOptions ++= Seq("-no-specialization", "-deprecation")

libraryDependencies ++= Seq("org.apache.hadoop" % "hadoop-client" % "1.2.1",
                        "org.apache.hadoop" % "hadoop-core" % "latest.integration" exclude ("hadoop-core", "org/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol.class") ,
                        "org.apache.hadoop" % "hadoop-common" % "2.5.1",
                        "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.5.1",
                        "commons-configuration" % "commons-configuration" % "1.9",
                        "org.apache.hadoop" % "hadoop-hdfs" % "latest.integration")


 jarName in assembly := "WordCount.jar"

 mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith(".xsd") => MergeStrategy.last
case s if s.endsWith(".dtd") => MergeStrategy.last
case s if s.endsWith(".xml") => MergeStrategy.last
case s if s.endsWith(".properties") => MergeStrategy.last
case x => old(x)
  }
}

2 个答案:

答案 0 :(得分:0)

我想line在这里隐式转换为String(感谢HImplicits?)。然后我们有

line.split(",", -1)(0) foreach somethigOrOther
  • 将字符串拆分为多个字符串 - .split(...)
  • 取这些字符串的第0个字段 - (0)
  • 然后对该字符串的字符重复somethingOrOther - foreach

因此,您获得了char

答案 1 :(得分:0)

我实际上是通过不使用_表示法直接指定context.write中的值来解决这个问题。所以而不是:

line.split(",", -1)(0) foreach (context.write(_,1))

我用过:

context.write(line.split(",", -1)(0), 1)

我在网上发现了一个项目,当时Scala在使用_时会对数据类型感到困惑,并建议只是明确定义值。不确定这是否属实,但在这种情况下它解决了问题。