Question

I am struggling with a Pyspark assignment. I am required to get a sum of all the viewing numbers per channels. I have 2 sets of files: 1 showing the show and views per show the other showing the shows and what channel they are shown on (can be multiple).

I have performed a join operation on the 2 files and the result looks like ..

[(u'Surreal_News', (u'BAT', u'11')),
 (u'Hourly_Sports', (u'CNO', u'79')),
 (u'Hourly_Sports', (u'CNO', u'3')),

I now need to extract the channel as the key and then I think do a reduceByKey to get the sum of views for the channels.

I have written this function to extract the chan as key with the views alongside, which I could then use a reduceByKey function to sum the results. However when I try to display results of below function with collect() I get an "AttributeError: 'tuple' object has no attribute 'split'" error

def extract_chan_views(show_chan_views):
    key_value = show_chan_views.split(",")
    chan_views = key_value[1].split(",")
    chan = chan_views[0]
    views = int(chan_views[1])
    return (chan,views)

Answer 1

我并没有完全解析你的代码，但是当我在两个数据集上应用连接转换时，我遇到了同样的错误。

比方说，A和B是两个RDD。

c = A.join(B)

我们可能认为c也是Rdd，错了。它是一个元组对象，我们不能执行任何拆分（“，”）类型的操作。需要将c转换为Rdd然后继续。

如果我们想要访问元组，可以说D是元组。

E= D[1] // instead of E= D.split(",")[1]

Pyspark tuple object has no attribute split

1 个答案: