Twitter热门标签使用Scala Apache Spark

时间:2014-11-07 05:02:03

标签: scala twitter apache-spark

我正在尝试使用apache spark和scala获取twitter热门标签。我能够打印主题标签但是当我开始使用reduce函数计算主题标签时,我收到以下错误

network.ConnectionManager:选择器线程被中断了!

我在这里添加代码。请帮我解决这个问题。

import java.io._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import StreamingContext._
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._

object TwitterPopularTags {

  def main(args: Array[String]) {


    val (master, filters) = (args(0), args.slice(5, args.length))

    // Twitter Authentication credentials
    System.setProperty("twitter4j.oauth.consumerKey", "****")
    System.setProperty("twitter4j.oauth.consumerSecret","****")
    System.setProperty("twitter4j.oauth.accessToken", "****")
    System.setProperty("twitter4j.oauth.accessTokenSecret", "****")


    val ssc = new StreamingContext(master, "TwitterPopularTags", Seconds(10),
      System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))

    val tweets = TwitterUtils.createStream(ssc, None)

    val statuses = tweets.map(status => status.getText())

    val words = statuses.flatMap(status => status.split(" "))
        val hashTags = words.filter(word => word.startsWith("#"))


     val counts = hashTags.map(tag => (tag, 1))
                         .reduceByKeyAndWindow(_ + _, _ - _, Seconds(60 * 5), Seconds(10))

    counts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
  

[error](run-main)java.lang.AssertionError:断言失败:   检查点目录尚未设置。请用   StreamingContext.checkpoint()或SparkContext.checkpoint()来设置   检查点目录。 java.lang.AssertionError:断言失败:   检查点目录尚未设置。请用   StreamingContext.checkpoint()或SparkContext.checkpoint()来设置   检查点目录。在scala.Predef $ .assert(Predef.scala:179)at   org.apache.spark.streaming.dstream.DStream.validate(DStream.scala:181)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $验证$ 10.apply(DStream.scala:227)     在   org.apache.spark.streaming.dstream.DStream $$ anonfun $验证$ 10.apply(DStream.scala:227)     在scala.collection.immutable.List.foreach(List.scala:318)at   org.apache.spark.streaming.dstream.DStream.validate(DStream.scala:227)     在   org.apache.spark.streaming.DStreamGraph $$ anonfun $开始$ 3.apply(DStreamGraph.scala:47)     在   org.apache.spark.streaming.DStreamGraph $$ anonfun $开始$ 3.apply(DStreamGraph.scala:47)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)     在   org.apache.spark.streaming.DStreamGraph.start(DStreamGraph.scala:47)     在   org.apache.spark.streaming.scheduler.JobGenerator.startFirstTime(JobGenerator.scala:114)     在   org.apache.spark.streaming.scheduler.JobGenerator.start(JobGenerator.scala:75)     在   org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:67)     在   org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:410)     在TwitterPopularTags $ .main(TwitterPopularTags.scala:77)at   TwitterPopularTags.main(TwitterPopularTags.scala)at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:606)[trace] Stack   trace suppress:运行last compile:运行完整输出。 14/11/07   20:07:43 INFO dstream.NetworkReceiver $ BlockGenerator:阻止推送   线程中断14/11/07 20:07:43 INFO network.ConnectionManager:   选择器线程被中断了! java.lang.RuntimeException:非零   退出代码:1在scala.sys.package $ .error(package.scala:27)[trace]   堆栈跟踪被抑制:运行最后一次编译:运行完整输出。   [错误](编译:运行)非零退出代码:1 [错误]总时间:41秒,   已完成2014年11月7日下午8:07:43

这是我在尝试运行上述代码时遇到的错误。

1 个答案:

答案 0 :(得分:0)

您正在使用reduceByKeyAndWindow,这会强制您激活Spark中的检查点。您可以查看如何执行此单行操作here