Question

我在Scala中有一个基于org.apache.hadoop库的MapReduce框架。它适用于简单的wordcount程序。但是，我想将它应用于有用的东西，并且遇到了障碍。我想获取一个csv文件（或任何分隔符）并传递第一列中的任何内容作为键，然后计算键的发生率。

映射器代码如下所示

class WordCountMapper extends Mapper[LongWritable, Text, Text, LongWritable] with HImplicits {
  protected override def map(lnNumber: LongWritable, line: Text, context: Mapper[LongWritable, Text, Text, LongWritable]#Context): Unit = {
  line.split(",", -1)(0) foreach (context.write(_,1))  //Splits data
  }
}

问题出在＆＃39; line.split＆＃39;码。当我尝试编译它时，我得到一个错误，上面写着：

发现：char 要求：org.apache.hadoop.io.Text

line.split ...应该返回一个在write（_，1）中传递给_的字符串，但是对于soem来说，它认为它是一个char。我甚至添加了.toString来明确地将它变成一个字符串但是它也没有用。

任何想法都表示赞赏。让我知道我可以提供哪些其他细节。

更新

以下是导入列表：

import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.{Reducer, Job, Mapper}
import org.apache.hadoop.conf.{Configured}
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import scala.collection.JavaConversions._
import org.apache.hadoop.util.{ToolRunner, Tool}

这是build.sbt代码：

import AssemblyKeys._ // put this at the top of the file

assemblySettings

organization := "scala"

name := "WordCount"

version := "1.0"

scalaVersion:= "2.11.2"

scalacOptions ++= Seq("-no-specialization", "-deprecation")

libraryDependencies ++= Seq("org.apache.hadoop" % "hadoop-client" % "1.2.1",
                        "org.apache.hadoop" % "hadoop-core" % "latest.integration" exclude ("hadoop-core", "org/apache/hadoop/hdfs/protocol/ClientDatanodeProtocol.class") ,
                        "org.apache.hadoop" % "hadoop-common" % "2.5.1",
                        "org.apache.hadoop" % "hadoop-mapreduce-client-core" % "2.5.1",
                        "commons-configuration" % "commons-configuration" % "1.9",
                        "org.apache.hadoop" % "hadoop-hdfs" % "latest.integration")


 jarName in assembly := "WordCount.jar"

 mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
  {case s if s.endsWith(".class") => MergeStrategy.last
case s if s.endsWith(".xsd") => MergeStrategy.last
case s if s.endsWith(".dtd") => MergeStrategy.last
case s if s.endsWith(".xml") => MergeStrategy.last
case s if s.endsWith(".properties") => MergeStrategy.last
case x => old(x)
  }
}

Answer 1

我想line在这里隐式转换为String（感谢HImplicits？）。然后我们有

line.split(",", -1)(0) foreach somethigOrOther

将字符串拆分为多个字符串 - .split(...)
取这些字符串的第0个字段 - (0)
然后对该字符串的字符重复somethingOrOther - foreach

因此，您获得了char。

Answer 2

我实际上是通过不使用_表示法直接指定context.write中的值来解决这个问题。所以而不是：

line.split(",", -1)(0) foreach (context.write(_,1))

我用过：

context.write(line.split(",", -1)(0), 1)

我在网上发现了一个项目，当时Scala在使用_时会对数据类型感到困惑，并建议只是明确定义值。不确定这是否属实，但在这种情况下它解决了问题。

Scala MapReduce框架提供类型不匹配

2 个答案: