我想在每个文档中提取特殊的双字母(not , word2)
,如果word2存在于(words.txt)文件中,则用一个数字(1)
替换这两个单词,否则不应该替换它。
这是我的数据(data.txt):
fit perfectly clie . purchased not instructions install helpful . improvement battery life not hoped .
product returned not fit nor solve problem ordered . company honest credited account .
cable good not work . cable extremely hot not recognize devices .
...
和(words.txt)文件:
hoped
instructions
work
fit
...
我试过了:
import org.apache.spark.{SparkConf, SparkContext}
object test {
def main(args: Array[String]): Unit = {
val conf1 = new SparkConf().setAppName("test").setMaster("local")
val sc = new SparkContext(conf1)
val searchList = sc.textFile("data/words.txt")
val searchBigram = searchList.map(word => ("not", word)).collect.toSet
val sample1 = sc.textFile("data/data.txt")
val sample2 = sample1.map(s => s.split( """\.""") // split on .
.map(_.split(" ") // split on space
.sliding(2) // take continuous pairs
.map { case Array(a, b) => (a, b)}
).map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
.map { case (e1, e2) => e1}.mkString(" "))
sample2.foreach(println)
}
}
预期输出为:
fit perfectly clie . purchased 1 install helpful . improvement battery life 1 .
product returned 1 nor solve problem ordered . company honest credited account .
cable good 1 . cable extremely hot 1 devices .
...
我的上述代码不完整且不起作用,有人可以帮助我吗?
答案 0 :(得分:0)
如果你想坚持使用bigram方法,我认为如果我们也会从搜索项目中创建bigrams,它会更好。
val searchList = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet
现在,从' .sliding(2)'的结果出发,我们可以将数组转换为元组:
val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}
//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
// We undo the tuples to obtain the modified string
val result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1
将这个想法整合到更大的计划中,这应该会产生一个工作过程。