我正在研究Hortonworks。我已经存储了从推特到Kafka主题的推文。我正在使用Kafka作为制作人和使用Spark作为消费者使用Scala在Spark-shell上对推文进行情绪分析。但我只想获取特定的来自文本,HashTag,推文等推文的内容是正面的还是负面的,来自推文的单词我选择作为正面或负面的单词。我的训练数据是Data.txt。
Data.txt包含单词和posititve,负面单词由Tab分隔....
像正面一样 厄运负面 注定负面的 怀疑是积极的我添加了依赖项:org.apache.spark:spark-streaming-kafka_2.10:1.6.2,org.apache.spark:spark-streaming_2.10:1.6.2
这是我的代码:
import org.apache.spark._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka._
val conf = new SparkConf().setMaster("local[4]").setAppName("KafkaReceiver")
val ssc = new StreamingContext(conf, Seconds(5))
val zkQuorum="sandbox.hortonworks.com:2181"
val group="test-consumer-group"
val topics="test"
val numThreads=5
val args=Array(zkQuorum, group, topics, numThreads)
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val hashTags = lines.flatMap(_.split(" ")).filter(_.startsWith("#"))
val wordSentimentFilePath = "hdfs://sandbox.hortonworks.com:8020/TwitterData/Data.txt"
val wordSentiments = ssc.sparkContext.textFile(wordSentimentFilePath).map { line =>
val Array(word, happiness) = line.split("\t")
(word, happiness)
} cache()
val happiest60 = hashTags.map(hashTag => (hashTag.tail, 1)).reduceByKeyAndWindow(_ + _, Seconds(60)).transform{topicCount => wordSentiments.join(topicCount)}.map{case (topic, tuple) => (topic, tuple._1 * tuple._2)}.map{case (topic, happinessValue) => (happinessValue, topic)}.transform(_.sortByKey(false))
happiest60.print()
ssc.start()
我得到了这样的输出,
(消极,恐惧)(积极,适应)
我想要这样的输出,
(#sports,来自推文的文字,健身,积极)
但我没有得到像上面那样存储Text和Hashtag的解决方案。