我正在尝试在Spark中的某些数据上使用FPGrowth函数。我在这里测试了这个例子没有问题: https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
但是,我的数据集来自hive
/**
* Implements a program with a pie chart
* that shows interactive polling results for three candidates
*/
import java.awt.*;
import javax.swing.*;
public class Poll extends JFrame
{
public Poll()
{
super("Vote for Pat, Ismail, or Clair");
Container c = getContentPane();
c.setBackground(Color.WHITE);
PollDisplayPanel chart = new PollDisplayPanel("Pat", "Ismail", "Clair");
PollControlPanel controls = new PollControlPanel(chart);
c.add(chart, BorderLayout.CENTER);
c.add(controls, BorderLayout.SOUTH);
}
public static void main(String[] args)
{
Poll w = new Poll();}
w.setBounds(300, 300, 400, 400);
w.setDefaultCloseOperation(EXIT_ON_CLOSE);
w.setVisible(true);
}
}
这种失败的方法不存在:
data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)
所以,我将它转换为RDD:
py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist
现在我开始收到一些奇怪的pickle序列化程序错误。
data=data.rdd
然后我开始查看类型。在该示例中,数据通过flatmap运行。这将返回与RDD不同的类型。
flatmap返回的RDD类型:pyspark.rdd.PipelinedRDD
hiveContext返回的RDD类型:pyspark.rdd.RDD
FPGrowth似乎只适用于PipelinedRDD。有什么方法可以将常规RDD转换为PipelinedRDD吗?
谢谢!
答案 0 :(得分:0)
好吧,我的查询错了,但改为使用collect_set然后 我设法绕过类型错误:
data=data.map(lambda row: row[0])