Spark数据集计数花费了很多时间

时间:2018-01-09 15:39:19

标签: apache-spark apache-spark-sql

我正在使用count函数来知道count是否大于0。 但是要花费超过5分钟才能使特定色谱柱的大小达到40,000,000个。

下面是我的代码垃圾。

specficManufacturerdetailsSource = source.filter(col("ManufacturerSource").equalTo(individualManufacturerName));
specficManufacturerdetailsTarget = target.filter(col("ManufacturerTarget").equalTo(individualManufacturerName));

manufacturerSourceCount=specficManufacturerdetailsSource.count();
manufacturerTargetCount=specficManufacturerdetailsTarget.count();


System.out.println("Size of specfic manufacturer source ML :"+manufacturerSourceCount+"Size of specfic manufacturer target"+manufacturerTargetCount);
if(manufacturerSourceCount > 0 && manufacturerTargetCount > 0 ){
}

1 个答案:

答案 0 :(得分:0)

根据您提到的上述要求,您不需要计算。

您可以使用findFirst()代替计数,如果您发现任何值 manufacturerSourceCount.isPresent()则表示count > 0