奇怪的是,这不起作用。有人可以解释背景吗?我想明白为什么不这样做。
输入文件是分布在多个文件夹中的拼花文件。当我打印结果时,它们按我想要的结构。当我在连接的数据帧上使用dataframe.count()时,作业将永远运行。任何人都可以帮助详细了解
import org.apache.spark.{SparkContext, SparkConf}
object TEST{
def main(args: Array[String] ) {
val appName = args(0)
val threadMaster = args(1)
val inputPathSent = args(2)
val inputPathClicked = args(3)
// pass spark configuration
val conf = new SparkConf()
.setMaster(threadMaster)
.setAppName(appName)
// Create a new spark context
val sc = new SparkContext(conf)
// Specify a SQL context and pass in the spark context we created
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// Create two dataframes for sent and clicked files
val dfSent = sqlContext.read.parquet(inputPathSent)
val dfClicked = sqlContext.read.parquet(inputPathClicked)
// Join them
val dfJoin = dfSent.join(dfClicked, dfSent.col("customer_id")
===dfClicked.col("customer_id") && dfSent.col("campaign_id")===
dfClicked.col("campaign_id"), "left_outer")
dfJoin.show(20) // perfectly shows the first 20 rows
dfJoin.count() //Here we run into trouble and it runs forever
}
}
答案 0 :(得分:0)
使用println(dfJoin.count()) 您将可以在屏幕上看到计数。