python spark distinct命令失败

时间:2016-02-24 18:11:27

标签: python apache-spark key distinct

打印(googleRecToToken.take(5))

以上命令返回

[('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary', 'builder', 'expand', 'vocabulary', 'contains', 'fun', 'lessons', 'teach', 'entertain', 'll', 'quickly', 'find', 'mastering', 'new', 'terms', 'includes', 'games']), ('http://www.google.com/base/feeds/snippets/8175198959985911471', ['topics', 'presents', 'museums', 'world', '5', 'cd', 'rom', 'set', 'step', 'behind', 'velvet', 'rope', 'examine', 'treasured', 'collections', 'antiquities', 'art', 'inventions', 'includes', 'following', 'louvre', 'virtual', 'visit', '25', 'rooms', 'full', 'screen', 'interactive', 'video', 'detailed', 'map', 'louvre']), ('http://www.google.com/base/feeds/snippets/18445827127704822533', ['sierrahome', 'hse', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'hallmark', 'card', 'studio', 'special', 'edition', 'win', '98', '2000', 'xp', 'sierrahome']), ('http://www.google.com/base/feeds/snippets/18274317756231697680', ['adobe', 'cs3', 'production', 'premium', 'academic', 'system', 'requirements', 'multicore', 'intel', 'processor', 'adobe', 'photoshop', 'extended', 'illustrator', 'flash', 'professional', 'effects', 'professional', 'universal', 'binary', 'also', 'work', 'powerpc', 'g4', 'g5', 'processor', 'adobe', 'onlocation', 'windows']), ('http://www.google.com/base/feeds/snippets/18409551702230917208', ['equisys', 'premium', 'support', 'zetafax', '2007', 'technical', 'support', '1', 'year', 'equisys', 'premium', 'support', 'zetafax', '2007', 'upgrade', 'license', '10', 'users', 'technical', 'support', 'phone', 'consulting', '1', 'year', '2', 'h'])]

命令

vendorRDD.map(lambda x: len(x[1])).reduce(lambda x, y : x + y)

适用于该RDD。但是下面的命令失败了。为什么呢?

print (RDD.distinct().countByKey())

错误如下

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-58-dae073faec09> in <module>()
     10 print (googleRecToToken.take(5))
     11 print ('\n')
---> 12 print (googleRecToToken.distinct().countByKey())
     13 
     14 

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByKey(self)
   1516         [('a', 2), ('b', 1)]
   1517         """
-> 1518         return self.map(lambda x: x[0]).countByValue()
   1519 
   1520     def join(self, other, numPartitions=None):

/usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/rdd.py in countByValue(self)
   1135                 m1[k] += v
   1136             return m1
-> 1137         return self.mapPartitions(countPartition).reduce(mergeMaps)
   1138 
   1139     def top(self, num, key=None):

1 个答案:

答案 0 :(得分:0)

我不保证这是你唯一能解决的问题,但根本问题是distinct无法处理你所显示的数据。

它需要对完整(键,值)对进行散列分区,而Python lists(数据中的值)不可清除。

您可以先将这些转换为某种哈希类型,例如:

vendorRDD.mapValues(tuple).distinct()

另见A list as a key for PySpark's reduceByKey