在python spark结构流中找到前k个记录

时间:2017-11-19 16:53:16

标签: apache-spark apache-spark-sql pyspark-sql spark-structured-streaming

我想计算PySpark结构化流媒体中的前k个记录,我现在可以按字频对数据进行排序,但是在我尝试了很多方法之后它也无法输出前k个记录,是否有一些优雅的方法可以做到这一点? / p>

Gere是我的代码:

## split the tweet into words, retaining timestamps
## split() splits each line into an array, and explode() turns the array into multiple rows
words = jsonoutput.select(
    explode(split(jsonoutput.tweet, ' ')).alias('word'),
    jsonoutput.state,
    jsonoutput.time_create
)

## launching query (query_json) that contains the parsed JSON records
windowedCounts = words.groupBy(
    window(words.time_create, "1 minutes", "1 minutes"),
    words.state,
    words.word
).count().orderBy('window','state','count')

query_json = windowedCounts.writeStream \
                       .outputMode("complete") \
                       .format("memory") \
                       .queryName("wc_test") \
                       .start()

当我运行spark sql:

spark.sql("select * from wc_test limit 15").show(15,False)

输出结果为:

+---------------------------------------------+-----+---------+-----+
|window                                       |state|word     |count|
+---------------------------------------------+-----+---------+-----+
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL   |broke    |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL   |up       |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL   |they     |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL   |Guess    |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL   |lol      |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |wrestling|1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |to       |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |save     |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |money    |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |Time     |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |for      |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |some     |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |fresh    |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA   |shoes    |1    |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|FL   |skype    |1    |
+---------------------------------------------+-----+---------+-----+

jsonoutput是从spark.readStream创建的,而班级是DataFrame[value: string, parsed_field: struct<lat:float,lon:float,tweet:string,time_create:string>, lat: float, lon: float, tweet: string, time_create: string, state: string]

0 个答案:

没有答案