我在Scala中有一个基于org.apache.hadoop库的MapReduce框架。它适用于简单的wordcount程序。但是,我想将它应用于有用的东西,并且遇到了障碍。我想获取一个csv文件(或任何分隔符)并传递第一列中的任何内容作为键,然后计算键的发生率。
映射器代码如下所示
class WordCountMapper extends Mapper[LongWritable, Text, Text, LongWritable] with HImplicits {
protected override def map(lnNumber: LongWritable, line: Text, context: Mapper[LongWritable, Text, Text, LongWritable]#Context): Unit = {
line.split(",", -1)(0) foreach (context.write(_,1)) //Splits data
}
}
问题出在' line.split'码。当我尝试编译它时,我得到一个错误,上面写着:
发现:char 要求:org.apache.hadoop.io.Text
line.split ...应该返回一个在write(_,1)中传递给_的字符串,但是对于soem来说,它认为它是一个char。我甚至添加了.toString来明确地将它变成一个字符串但是它也没有用。
任何想法都表示赞赏。让我知道我可以提供哪些其他细节。
更新
以下是导入列表:
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.{Reducer, Job, Mapper}
import org.apache.hadoop.conf.{Configured}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import scala.collection.JavaConversions._
import org.apache.hadoop.util.{ToolRunner, Tool}
这是build.sbt代码:
import AssemblyKeys._ // put this at the top of the file
assemblySettings
organization := "scala"
name := "WordCount"
version := "1.0"
scalaVersion:= "2.11.2"
scalacOptions ++= Seq("-no-specialization", "-deprecation")
libraryDependencies ++= Seq("org.apache.hadoop" % "hadoop-client" % "1.2.1",
"org.apache.hadoop" % "hadoop-core" % "latest.integration" exclude ("hadoop-core", "org/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol.class") ,
"org.apache.hadoop" % "hadoop-common" % "2.5.1",
"org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.5.1",
"commons-configuration" % "commons-configuration" % "1.9",
"org.apache.hadoop" % "hadoop-hdfs" % "latest.integration")
jarName in assembly := "WordCount.jar"
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith(".xsd") => MergeStrategy.last
case s if s.endsWith(".dtd") => MergeStrategy.last
case s if s.endsWith(".xml") => MergeStrategy.last
case s if s.endsWith(".properties") => MergeStrategy.last
case x => old(x)
}
}
答案 0 :(得分:0)
我想line
在这里隐式转换为String
(感谢HImplicits
?)。然后我们有
line.split(",", -1)(0) foreach somethigOrOther
.split(...)
(0)
somethingOrOther
- foreach
因此,您获得了char
。
答案 1 :(得分:0)
我实际上是通过不使用_表示法直接指定context.write中的值来解决这个问题。所以而不是:
line.split(",", -1)(0) foreach (context.write(_,1))
我用过:
context.write(line.split(",", -1)(0), 1)
我在网上发现了一个项目,当时Scala在使用_时会对数据类型感到困惑,并建议只是明确定义值。不确定这是否属实,但在这种情况下它解决了问题。