pyspark FPGrowth不适用于RDD

时间:2016-04-29 00:30:51

标签: pyspark apache-spark-mllib

我正在尝试在Spark中的某些数据上使用FPGrowth函数。我在这里测试了这个例子没有问题: https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html

但是,我的数据集来自hive

/**
* Implements a program with a pie chart
* that shows interactive polling results for three candidates
*/

  import java.awt.*;
  import javax.swing.*;

  public class Poll extends JFrame
 {
 public Poll()
  {
    super("Vote for Pat, Ismail, or Clair");

Container c = getContentPane();
c.setBackground(Color.WHITE);
PollDisplayPanel chart = new PollDisplayPanel("Pat", "Ismail", "Clair");
PollControlPanel controls = new PollControlPanel(chart);
c.add(chart, BorderLayout.CENTER);
c.add(controls, BorderLayout.SOUTH);
  }

   public static void main(String[] args)
  {
Poll w = new Poll();}
   w.setBounds(300, 300, 400, 400);
w.setDefaultCloseOperation(EXIT_ON_CLOSE);
w.setVisible(true);
  }
  }

这种失败的方法不存在:

data = hiveContext.sql('select transactionid, itemid from transactions')
model = FPGrowth.train(data, minSupport=0.1, numPartitions=100)

所以,我将它转换为RDD:

py4j.protocol.Py4JError: An error occurred while calling o764.trainFPGrowthModel. Trace:
py4j.Py4JException: Method trainFPGrowthModel([class org.apache.spark.sql.DataFrame, class java.lang.Double, class java.lang.Integer]) does not exist

现在我开始收到一些奇怪的pickle序列化程序错误。

data=data.rdd

然后我开始查看类型。在该示例中,数据通过flatmap运行。这将返回与RDD不同的类型。

flatmap返回的RDD类型:pyspark.rdd.PipelinedRDD

hiveContext返回的RDD类型:pyspark.rdd.RDD

FPGrowth似乎只适用于PipelinedRDD。有什么方法可以将常规RDD转换为PipelinedRDD吗?

谢谢!

1 个答案:

答案 0 :(得分:0)

好吧,我的查询错了,但改为使用collect_set然后 我设法绕过类型错误:

data=data.map(lambda row: row[0])