我们正在构建情感分析应用程序,并将推文数据框转换为数组。我们创建了另一个由肯定词组成的数组。但是我们无法计算包含这些肯定词之一的推文的数量。我们尝试了这些,结果为1。它必须大于1。显然它不算:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
var tweetDF = sqlContext.read.json("hdfs:///sandbox/tutorial-files/770/tweets_staging/*")
tweetDF.show()
var messages = tweetDF.select("msg").collect.map(_.toSeq)
println("Total messages: " + messages.size)
val positive = Source.fromFile("/home/teslavm/positive.txt").getLines.toArray
var happyCount=0
for (e <- 0 until messages.size) {
for (f <- 0 until positive.size) {
if (messages(e).contains(positive(f))){
happyCount=happyCount+1
}
}
}
print("\nNumber of happy messages: " +happyCount)
答案 0 :(得分:0)
这应该有效。 它的优点是您不必收集结果,并且功能更多。
val messages = tweetDF.select("msg").as[String]
val positiveWords =
Source
.fromFile("/home/teslavm/positive.txt")
.getLines
.toList
.map(word => word.toLowerCase)
def hasPositiveWords(message: String): Boolean = {
val _message = message.toLowerCase
positiveWords.exists(word => _message.contains(word))
}
val positiveMessages = messages.filter(hasPositiveWords _)
println(positiveMessages.count())
我在本地使用以下代码测试了该代码:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._
val tweetDF = List(
(1, "Yes I am happy"),
(2, "Sadness is a way of life"),
(3, "No, no, no, no, yes")
).toDF("id", "msg")
val positiveWords = List("yes", "happy")
它奏效了。