我需要开发一个Spark应用程序,它将使用流式传输来创建RDDs
,然后将DStream
的RDD转换为DataFrame
并运行SQL
查询,这类似这个Spark example
挑战在于,我需要运行多个SQL,而不是运行一个SQL
。我不确定DataFrame是否是线程安全的所以我可以创建带有SQLContext的线程,并且可以并行运行查询和'action',或者有一种方法我可以在Spark中更清洁。以下我如何按顺序执行此操作,但我想同时执行此操作。
DataFrame wordsDataFrame = sqlContext.createDataFrame(rowRDD, JavaRecord.class);
// Register as table
wordsDataFrame.registerTempTable("words");
// Do word count on table using SQL and print it
DataFrame wordCountsDataFrame =
sqlContext.sql("select word, count(*) as total from words group by word");
System.out.println("========= " + time + "=========");
// Do word count AGAIN on table using SQL and print it
DataFrame wordCountsDataFrame2 =
sqlContext.sql("select word, count(*) as total from words where len(word) > 10 group by word");
System.out.println("========= " + time + "=========");
wordCountsDataFrame2.show();