Question

我尝试从文本文件中提取单词。

TEXTFILE：

"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"

以下效果很好：

val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)


scala> data.count
res14: Long = 3

scala> all.count
res11: Long = 1419

但我想为每一行提取单词。如果我输入

val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))

我得到了

scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
 found   : Char
 required: CharSequence
       val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))

我做错了什么？

提前致谢

Answer 1

感谢您的回答。

目标是计算pos / neg-wordlist中单词的出现次数。

似乎有效：

// load inputfile 
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()

// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet

// RegEx
val regexpr = """[a-zA-Z]+""".r


val separated = data.map(line => regexpr.findAllIn(line).toList) 

// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))

Answer 2

你的问题不完全是Apache Spark，你的第一个地图会让你处理一条线，但你那条线上的 flatMap 会让你对这个字符进行迭代line String 。所以Spark或不是，你的代码将无法工作，例如在Scala REPL中：

> val lines = List("Line1 with words to extract", 
                   "Line2 with words to extract", 
                   "Line3 with words to extract")

> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)

  <console>:9: error: type mismatch;
    found   : Char
    required: CharSequence

因此，如果您真的想要使用正则表达式，行中的所有单词，只需使用flatMap一次：

 scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
        res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)

此致

apache-spark regex从rdd中提取单词

2 个答案: