我在Kafka有一些推特数据,现在我尝试使用pyspark流分析每个状态的前k字频率,数据如下:
"AK", "hello", 2
"AK", "world", 2
"AK", "cool", 1
"MN", "hello", 1
"MN", "world", 1
"MN", "cruel", 1
我想要生成的输出是:
def get_word_count(line):
tokens = get_tokens(line['tweet'])
state = line['state']
return [state, tokens]
dstream_tweets.flatMap(lambda line: get_word_count(line)) \
.map(lambda line:((line[0], line[1]), 1)) \
.reduceByKey(lambda x,y : x+y)
我的代码如下所示:
dstream_tweets
pyspark.streaming.dstream.TransformedDStream
的班级是 let path = UIBezierPath(roundedRect: self.testButton.bounds, byRoundingCorners:[.topRight, .bottomRight], cornerRadii: CGSize(width: 10, height: 10))
let maskLayer = CAShapeLayer()
maskLayer.frame = self.testButton.bounds
maskLayer.path = path.cgPath
self.yourButton.layer.mask = maskLayer
此代码无法计算流媒体数据中每个州的top-k twitter字频率,有没有办法做到这一点?