Question

num_of_words = (doc_title,num) #number of words in a document
lines = (doc_title,word,num_of_occurrences) #number of occurrences of a specific word in a document

当我调用lines.join（num_of_words）时，我期待得到类似的内容：

(doc_title,(word,num_of_occurrences,num))

但我改为：

(doc_title,(word,num))

省略了

和num_of_occurrences。我在这做错了什么？我怎么能加入这两个RDD才能得到我期待的结果？

Answer 1

在join方法的Spark API docs中：

加入（其他，numPartitions =无）

返回一个包含所有元素对的RDD，其中包含自己和其他的匹配键。

每对元素将作为（k，（v1，v2））元组返回，其中（k，v1）在self中，（k，v2）在其他元素中。

因此join方法只能用于对（或者至少只返回所描述表单的结果）。

克服这种情况的方法是使用（doc_title，（word，num_occurrences））元组而不是（doc_title，word，num_occurrences）。工作示例：

num_of_words = sc.parallelize([("harry potter", 4242)])
lines = sc.parallelize([("harry potter", ("wand", 100))])
result = lines.join(num_of_words)
print result.collect()
# [('harry potter', (('wand', 100), 4242))]

（请注意，sc.parallelize仅将本地python集合转换为Spark RDD，而collect（）则完全相反）

pyspark中的join（）不会产生预期的结果

1 个答案: