我想将文件中的每个单词与外部单词列表进行比较,请查看此示例:
我的数据文件是:
surprise heard thump opened door small seedy man clasping package wrapped.
upgrading system found review spring 2008 issue moody audio mortgage backed.
omg left gotta wrap review order asap . understand hand delivered dali lama
speak hands wear earplugs lives . listen maintain link long .
buffered lightning thousand volts cables burned revivification place .
cables cables finally able hear auditory gem long rumored music .
...
和外部词文件是:
thump,1
man,-1
small,-1
surprise,-1
system,1
wrap,1
left,1
lives,-1
place,-1
lightning,-1
long,1
...
当比较这些单词时,如果每个文档中的某些单词与外部单词相同,则将它们的值相加,最后我们对每个文档都有一个分数 和预期的产出是:
-2 ; surprise heard thump opened door small seedy man clasping package wrapped.
1 ; upgrading system found review spring 2008 issue moody audio mortgage backed.
2 ; omg left gotta wrap review order asap . understand hand delivered dali lama
0 ; speak hands wear earplugs lives . listen maintain link long .
-2 ; buffered lightning thousand volts cables burned revivification place .
1 ; cables cables finally able hear auditory gem long rumored music .
...
我试过了:
object test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val searchList = sc.textFile("data/words.txt")
val sentilex = searchList.map({ (line) =>
val Array(a,b) = line.split(",").map(_.trim)
(a,b.toInt)
}).collect().toVector
val lex=sentilex.map(a=>a._1)
val lab=sentilex.map(b=>b._2)
val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(line=>line.split(" "))
val sample3 = sample2.map(elem => if (lex.contains(elem)) ("1") else elem)
sample3.foreach(println)
}
}
有人能帮助我吗?
答案 0 :(得分:4)
嗨我认为做你想做的最好的方法是使用广播值来发送sentilex,然后使用map函数计算总和。在代码中会是这样的:
object test {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val searchList = sc.textFile("data/words.txt")
val sentilex = sc.broadcast(searchList.map({ (line) =>
val Array(a,b) = line.split(",").map(_.trim)
(a,b.toInt)
}).collect().toMap)
val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(line=>(line.split(" ").map(word => sentilex.value.getOrElse(word, 0)).reduce(_ + _), line))
sample2.collect.foreach(println)
}
}
我希望这会有用