检查推文数组的元素是否包含肯定单词数组的元素之一并计数

时间:2019-03-22 14:48:01

标签: scala apache-spark

我们正在构建情感分析应用程序,并将推文数据框转换为数组。我们创建了另一个由肯定词组成的数组。但是我们无法计算包含这些肯定词之一的推文的数量。我们尝试了这些,结果为1。它必须大于1。显然它不算:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
var tweetDF = sqlContext.read.json("hdfs:///sandbox/tutorial-files/770/tweets_staging/*")
tweetDF.show()
var messages = tweetDF.select("msg").collect.map(_.toSeq) 
println("Total messages: " + messages.size)
val positive = Source.fromFile("/home/teslavm/positive.txt").getLines.toArray
var happyCount=0
for (e <- 0 until messages.size) {
    for (f <- 0 until positive.size) {
        if (messages(e).contains(positive(f))){
        happyCount=happyCount+1
    }
    }
}
print("\nNumber of happy messages: " +happyCount) 

enter image description here

1 个答案:

答案 0 :(得分:0)

这应该有效。 它的优点是您不必收集结果,并且功能更多。

val messages = tweetDF.select("msg").as[String]

val positiveWords =
  Source
    .fromFile("/home/teslavm/positive.txt")
    .getLines
    .toList
    .map(word => word.toLowerCase)

def hasPositiveWords(message: String): Boolean = {
  val _message = message.toLowerCase
  positiveWords.exists(word => _message.contains(word))
}  

val positiveMessages = messages.filter(hasPositiveWords _)

println(positiveMessages.count())

我在本地使用以下代码测试了该代码:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder.master("local[*]").getOrCreate()
import spark.implicits._

val tweetDF = List(
  (1, "Yes I am happy"),
  (2, "Sadness is a way of life"),
  (3, "No, no, no, no, yes")
).toDF("id", "msg")

val positiveWords = List("yes", "happy")

它奏效了。