假设这些是我的文件:
very pleased product . phone lightweight comfortable sound quality good house yard .
quality construction phone base unit good . ample supply cable adapter . plug computer soundcard .
shop unit mail rebate . unit battery pack hold play time strap carr headphone adapter cable perfect digital copy optical. component micro plug stereo connector cable micro plug rca cable .
unit primarily record guitar jam session . input plug provide power plug microphone . decent stereo mic need digital recording performance . mono mode double recording time .
admit like new electronic toy . digital camera not impress .
我希望在每个文档中从每个句子中提取所有bigrams和trigrams及其出现次数。
我尝试过:
case class trigram(first: String, second: String,third: String) {
def mkReplacement(s: String) = s.replaceAll(first + " " + second + " " + third, first + "-" + second + "-" + third)
}
def stringToTrigrams(s: String) = {
val words = s.split(".")
if (words.size >= 3) {
words.sliding(3).map(a => tigram(a(0),a(1),a(2)))
}
else
Iterator[tigram]()
}
val conf = new SparkConf()
val sc = new SparkContext(conf)
val data = sc.textFile("docs")
val trigrams = data.flatMap {
stringToTrigrams
}.collect()
val trigramCounts = trigrams.groupBy(identity).mapValues(_.size)
但它没有显示任何三元组?
答案 0 :(得分:3)
def stringToTrigrams(s: String) = {
val words = s.split(".")
if (words.size >= 3) {
words.sliding(3).map(a => trigram(a(0),a(1),a(2)))
} else Iterator[trigram]()
}
IIUC,此功能正在上面的整个文件,然后将文件拆分为"。"。这是你的第一个问题。调用split("。")并不能完成您的想法。你实际上是在分配一个通配符而不是"。"像你要的那样。将其更改为" \。"你会把文件分成句子。
一旦完成,我们需要通过简单地拆分我推荐的空白来将句子分成单词,_.split(\\s+)
将在所有空格上拆分。现在你应该能够解析单词并使用如下函数创建三元组:
def stringToTrigrams(s: String) = {
val sentences = s.split("\\.")
sentences flatMap { sent =>
val words = sent.split("\\s+").filter(_ != "")
if (words.length >= 3)
words.sliding(3).map(a => trigram(a(0), a(1), a(2))
else Iterator[trigram]
}
}
希望这有帮助。