我正在使用Pyspark MLlib FPGrowth algorithm进行一些工作,并且在每行中包含重复事务的重复示例的rdd。这导致模型训练函数由于这些重复而引发错误。我对Spark很新,我想知道如何删除 rdd行中的重复项。举个例子:
#simple example
from pyspark.mllib.fpm import FPGrowth
data = [["a", "a", "b", "c"], ["a", "b", "d", "e"], ["a", "a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
看起来像:
#simple example
from pyspark.mllib.fpm import FPGrowth
data_dedup = [["a", "b", "c"], ["a", "b", "d", "e"], ["a", "c", "e"], ["a", "c", "f"]]
rdd = sc.parallelize(data_dedup)
model = FPGrowth.train(rdd, 0.6, 2)
freqit = model.freqItemsets()
freqit.collect()
并且会毫无错误地运行。
提前致谢!
答案 0 :(得分:1)
像这样使用:
rdd = rdd.map(lambda x: list(set(x)))
这将删除重复的内容。