I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).
I have performed a join operation on the 2 files and the result looks like ..
[(u'Surreal_News', (u'BAT', u'11')),
(u'Hourly_Sports', (u'CNO', u'79')),
(u'Hourly_Sports', (u'CNO', u'3')),
I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.
I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error
def extract_chan_views(show_chan_views):
key_value = show_chan_views.split(",")
chan_views = key_value[1].split(",")
chan = chan_views[0]
views = int(chan_views[1])
return (chan,views)
答案 0 :(得分:1)
我并没有完全解析你的代码,但是当我在两个数据集上应用连接转换时,我遇到了同样的错误。
比方说,A和B是两个RDD。
c = A.join(B)
我们可能认为c也是Rdd,错了。它是一个元组对象,我们不能执行任何拆分(“,”)类型的操作。需要将c转换为Rdd然后继续。
如果我们想要访问元组,可以说D是元组。
E= D[1] // instead of E= D.split(",")[1]