比较Scala和Spark中的两个文件内容

时间:2015-05-12 14:40:06

标签: scala text apache-spark

我想将文件中的每个单词与外部单词列表进行比较,请查看此示例:

我的数据文件是:

surprise heard thump opened door small seedy man clasping package wrapped.

upgrading system found review spring 2008 issue moody audio mortgage backed.

omg left gotta wrap review order asap . understand hand delivered dali lama

speak hands wear earplugs lives . listen maintain link long .

buffered lightning thousand volts cables burned revivification place .

cables cables finally able hear auditory gem long rumored music .
...

和外部词文件是:

thump,1
man,-1
small,-1
surprise,-1
system,1
wrap,1
left,1
lives,-1
place,-1
lightning,-1
long,1
...

当比较这些单词时,如果每个文档中的某些单词与外部单词相同,则将它们的值相加,最后我们对每个文档都有一个分数 和预期的产出是:

 -2 ; surprise heard thump opened door small seedy man clasping package wrapped.

 1 ; upgrading system found review spring 2008 issue moody audio mortgage backed.

 2 ; omg left gotta wrap review order asap . understand hand delivered dali lama

 0 ; speak hands wear earplugs lives . listen maintain link long .

 -2 ; buffered lightning thousand volts cables burned revivification place .

 1 ; cables cables finally able hear auditory gem long rumored music .
 ...

我试过了:

object test {

def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val searchList = sc.textFile("data/words.txt")

val sentilex = searchList.map({ (line) =>
  val Array(a,b) = line.split(",").map(_.trim)
  (a,b.toInt)
}).collect().toVector

val lex=sentilex.map(a=>a._1)
val lab=sentilex.map(b=>b._2)
val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(line=>line.split(" "))
val sample3 = sample2.map(elem => if (lex.contains(elem)) ("1") else elem)
sample3.foreach(println)
 }
}

有人能帮助我吗?

1 个答案:

答案 0 :(得分:4)

嗨我认为做你想做的最好的方法是使用广播值来发送sentilex,然后使用map函数计算总和。在代码中会是这样的:

object test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val searchList = sc.textFile("data/words.txt")

val sentilex = sc.broadcast(searchList.map({ (line) =>
  val Array(a,b) = line.split(",").map(_.trim)
  (a,b.toInt)
  }).collect().toMap)    

val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(line=>(line.split(" ").map(word => sentilex.value.getOrElse(word, 0)).reduce(_ + _), line))
sample2.collect.foreach(println)
 }
}

我希望这会有用