可以打印出Spark Dataframe内容,但(例如)不计算

时间:2015-10-19 13:52:09

标签: scala apache-spark dataframe

奇怪的是,这不起作用。有人可以解释背景吗?我想明白为什么不这样做。

输入文件是分布在多个文件夹中的拼花文件。当我打印结果时,它们按我想要的结构。当我在连接的数据帧上使用dataframe.count()时,作业将永远运行。任何人都可以帮助详细了解

 import org.apache.spark.{SparkContext, SparkConf}

        object TEST{

          def main(args: Array[String] )  {

            val appName = args(0)
            val threadMaster = args(1)
            val inputPathSent = args(2)
            val inputPathClicked = args(3)


            // pass spark configuration
            val conf = new SparkConf()
              .setMaster(threadMaster)
              .setAppName(appName)

            // Create a new spark context
            val sc = new SparkContext(conf)

            // Specify a SQL context and pass in the spark context we created
            val sqlContext = new org.apache.spark.sql.SQLContext(sc)


            // Create two dataframes for sent and clicked files
            val dfSent = sqlContext.read.parquet(inputPathSent)
            val dfClicked = sqlContext.read.parquet(inputPathClicked)


            // Join them            
            val dfJoin = dfSent.join(dfClicked, dfSent.col("customer_id")
            ===dfClicked.col("customer_id") &&  dfSent.col("campaign_id")===
            dfClicked.col("campaign_id"), "left_outer")

            dfJoin.show(20) // perfectly shows the first 20 rows
            dfJoin.count() //Here we run into trouble and it runs forever
          }
    }

1 个答案:

答案 0 :(得分:0)

使用println(dfJoin.count()) 您将可以在屏幕上看到计数。