apache-spark regex从rdd中提取单词

时间:2015-03-03 17:15:57

标签: regex scala apache-spark rdd

我尝试从文本文件中提取单词。

TEXTFILE:

"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"

以下效果很好:

val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)


scala> data.count
res14: Long = 3

scala> all.count
res11: Long = 1419

但我想为每一行提取单词。 如果我输入

val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))

我得到了

scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
 found   : Char
 required: CharSequence
       val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))

我做错了什么?

提前致谢

2 个答案:

答案 0 :(得分:2)

感谢您的回答。

目标是计算pos / neg-wordlist中单词的出现次数。

似乎有效:

// load inputfile 
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()

// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet

// RegEx
val regexpr = """[a-zA-Z]+""".r


val separated = data.map(line => regexpr.findAllIn(line).toList) 

// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))

答案 1 :(得分:0)

你的问题不完全是Apache Spark,你的第一个地图会让你处理一条线,但你那条线上的 flatMap 会让你对这个字符进行迭代line String 。所以Spark或不是,你的代码将无法工作,例如在Scala REPL中:

> val lines = List("Line1 with words to extract", 
                   "Line2 with words to extract", 
                   "Line3 with words to extract")

> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)

  <console>:9: error: type mismatch;
    found   : Char
    required: CharSequence

因此,如果您真的想要使用正则表达式,行中的所有单词,只需使用flatMap一次:

 scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
        res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)

此致