我尝试从文本文件中提取单词。
TEXTFILE:
"Line1 with words to extract"
"Line2 with words to extract"
"Line3 with words to extract"
以下效果很好:
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
val all = data.flatMap(a => "[a-zA-Z]+".r findAllIn a)
scala> data.count
res14: Long = 3
scala> all.count
res11: Long = 1419
但我想为每一行提取单词。 如果我输入
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
我得到了
scala> val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
<console>:17: error: type mismatch;
found : Char
required: CharSequence
val separated = data.map(line => line.flatMap(a => "[a-zA-Z]+".r findAllIn a))
我做错了什么?
提前致谢
答案 0 :(得分:2)
感谢您的回答。
目标是计算pos / neg-wordlist中单词的出现次数。
似乎有效:
// load inputfile
val file_in = "/path/to/teststring.txt"
val data = sc.textFile(file_in).map(_.toLowerCase).cache()
// load wordlists
val pos_file = "/path/to/pos_list.txt"
val neg_file = "/path/to/neg_list.txt"
val pos_words = sc.textFile(pos_file).cache().collect().toSet
val neg_words = sc.textFile(neg_file).cache().collect().toSet
// RegEx
val regexpr = """[a-zA-Z]+""".r
val separated = data.map(line => regexpr.findAllIn(line).toList)
// #_of_words - #_of_pos_words_ - #_of_neg_words
val counts = separated.map(list => (list.size,(list.filter(pos => pos_words contains pos)).size, (list.filter(neg => neg_words contains neg)).size))
答案 1 :(得分:0)
你的问题不完全是Apache Spark,你的第一个地图会让你处理一条线,但你那条线上的 flatMap 会让你对这个字符进行迭代line String 。所以Spark或不是,你的代码将无法工作,例如在Scala REPL中:
> val lines = List("Line1 with words to extract",
"Line2 with words to extract",
"Line3 with words to extract")
> lines.map( line => line.flatMap("[a-zA-Z]+".r findAllIn _)
<console>:9: error: type mismatch;
found : Char
required: CharSequence
因此,如果您真的想要使用正则表达式,行中的所有单词,只需使用flatMap一次:
scala> lines.flatMap("[a-zA-Z]+".r findAllIn _)
res: List[String] = List(Line, with, words, to, extract, Line, with, words, to, extract, Line, with, words, to, extract)
此致