我想计算PySpark结构化流媒体中的前k个记录,我现在可以按字频对数据进行排序,但是在我尝试了很多方法之后它也无法输出前k个记录,是否有一些优雅的方法可以做到这一点? / p>
Gere是我的代码:
## split the tweet into words, retaining timestamps
## split() splits each line into an array, and explode() turns the array into multiple rows
words = jsonoutput.select(
explode(split(jsonoutput.tweet, ' ')).alias('word'),
jsonoutput.state,
jsonoutput.time_create
)
## launching query (query_json) that contains the parsed JSON records
windowedCounts = words.groupBy(
window(words.time_create, "1 minutes", "1 minutes"),
words.state,
words.word
).count().orderBy('window','state','count')
query_json = windowedCounts.writeStream \
.outputMode("complete") \
.format("memory") \
.queryName("wc_test") \
.start()
当我运行spark sql:
spark.sql("select * from wc_test limit 15").show(15,False)
输出结果为:
+---------------------------------------------+-----+---------+-----+
|window |state|word |count|
+---------------------------------------------+-----+---------+-----+
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL |broke |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL |up |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL |they |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL |Guess |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|AL |lol |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |wrestling|1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |to |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |save |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |money |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |Time |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |for |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |some |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |fresh |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|CA |shoes |1 |
|[2017-11-17 16:51:00.0,2017-11-17 16:52:00.0]|FL |skype |1 |
+---------------------------------------------+-----+---------+-----+
jsonoutput
是从spark.readStream
创建的,而班级是DataFrame[value: string, parsed_field: struct<lat:float,lon:float,tweet:string,time_create:string>, lat: float, lon: float, tweet: string, time_create: string, state: string]