每组中的python spark流式字数

时间:2017-11-19 16:40:18

标签: apache-spark mapreduce pyspark spark-streaming

我在Kafka有一些推特数据,现在我尝试使用pyspark流分析每个状态的前k字频率,数据如下:

"AK", "hello", 2
"AK", "world", 2
"AK", "cool", 1
"MN", "hello", 1
"MN", "world", 1
"MN", "cruel", 1

我想要生成的输出是:

def get_word_count(line):
    tokens = get_tokens(line['tweet'])
    state = line['state']
    return [state, tokens]

dstream_tweets.flatMap(lambda line: get_word_count(line)) \
              .map(lambda line:((line[0], line[1]), 1)) \
              .reduceByKey(lambda x,y : x+y)

我的代码如下所示:

dstream_tweets

pyspark.streaming.dstream.TransformedDStream的班级是 let path = UIBezierPath(roundedRect: self.testButton.bounds, byRoundingCorners:[.topRight, .bottomRight], cornerRadii: CGSize(width: 10, height: 10)) let maskLayer = CAShapeLayer() maskLayer.frame = self.testButton.bounds maskLayer.path = path.cgPath self.yourButton.layer.mask = maskLayer

此代码无法计算流媒体数据中每个州的top-k twitter字频率,有没有办法做到这一点?

0 个答案:

没有答案