比较文本的双字母词与输入文件词

时间:2015-05-09 20:17:24

标签: scala text apache-spark

我想在每个文档中提取特殊的双字母(not , word2),如果word2存在于(words.txt)文件中,则用一个数字(1)替换这两个单词,否则不应该替换它。

这是我的数据(data.txt):

fit perfectly clie . purchased not instructions install helpful . improvement battery life not hoped .

product returned not fit nor solve problem ordered . company honest credited account .

cable good not work . cable extremely hot not recognize devices .
...

和(words.txt)文件:

hoped
instructions
work
fit
...

我试过了:

   import org.apache.spark.{SparkConf, SparkContext} 

   object test {

   def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("test").setMaster("local")
    val sc = new SparkContext(conf1)
    val searchList = sc.textFile("data/words.txt")
    val searchBigram = searchList.map(word => ("not", word)).collect.toSet
    val sample1 = sc.textFile("data/data.txt")
    val sample2 = sample1.map(s => s.split( """\.""") // split on .
      .map(_.split(" ") // split on space
      .sliding(2) // take continuous pairs
      .map { case Array(a, b) => (a, b)}
      ).map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)
      .map { case (e1, e2) => e1}.mkString(" "))
    sample2.foreach(println)
     }
    }

预期输出为:

fit perfectly clie . purchased 1 install helpful . improvement battery life 1 .

product returned 1 nor solve problem ordered . company honest credited account .

cable good 1 . cable extremely hot 1 devices . 
...

我的上述代码不完整且不起作用,有人可以帮助我吗?

1 个答案:

答案 0 :(得分:0)

如果你想坚持使用bigram方法,我认为如果我们也会从搜索项目中创建bigrams,它会更好。

val searchList  = sc.textFile("input_file")
// let's make this also into bigrams and collect as a set
// making the assumption that this list is relatively small and fit in memory
val searchBigram = searchList.map(word => ("not", word)).collect.toSet

现在,从' .sliding(2)'的结果出发,我们可以将数组转换为元组:

val sample = "improvement battery life not hoped".split
// bigrams is an iterator of (improvement,battery), (battery,life), (life,not), (not,hoped)
val bigrams = sample.sliding(2).map{case Array(e1,e2) => (e1,e2)}

//Now we use our bigram search set to find/replace the matching bigrams
// -> (improvement,battery), (battery,life), (life,not), (1,1)
val replaced = bigrams.map(elem => if (searchBigram.contains(elem)) ("1", "1") else elem)

// We undo the tuples to obtain the modified string
val  result = replaced.map{case (e1,e2) => e1}.mkString(" ")
// result:String = improvement battery life 1

将这个想法整合到更大的计划中,这应该会产生一个工作过程。