在Spark RDD中按行删除重复项

时间:2016-09-06 15:25:17

标签: apache-spark machine-learning pyspark data-science

我正在使用Pyspark MLlib FPGrowth algorithm进行一些工作,并且在每行中包含重复事务的重复示例的rdd。这导致模型训练函数由于这些重复而引发错误。我对Spark很新,我想知道如何删除 rdd行中的重复项。举个例子:

 #simple example
    from pyspark.mllib.fpm import FPGrowth

    data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
    rdd = sc.parallelize(data)
    model = FPGrowth.train(rdd, 0.6, 2)
    freqit = model.freqItemsets()
    freqit.collect()

看起来像:

#simple example
from pyspark.mllib.fpm import FPGrowth

data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()

并且会毫无错误地运行。

提前致谢!

1 个答案:

答案 0 :(得分:1)

像这样使用:

rdd = rdd.map(lambda x: list(set(x)))

这将删除重复的内容。