Question

joined_data_filtered = spark.sql("SELECT T.TransactionId, T.CustomerId, T.StartClusterId, T.EndCentermostClusterId, T.EndClusterId, T.StartCellId, T.EndCellId, T.EndCentermostCellId, T.EndCentermostLatitude, T.EndCentermostLongitude FROM joined_trip_data AS T LEFT JOIN(SELECT CustomerId,StartClusterId,EndClusterId FROM joined_trip_data WHERE EndCellId = StartCellId GROUP BY CustomerId, StartClusterId,EndClusterId) AS D ON T.CustomerId = D.CustomerId AND T.StartClusterId = D.StartClusterId AND T.EndClusterId = D.EndClusterId WHERE D.CustomerId IS NULL")

在我的pyspark脚本中，最初，我将开始和结束位置聚类，然后删除具有相同开始和结束位置的数据。我选择具有相同起始和结束cellid的数据，并将其startclusterid，endclusterid和左连接到数据集，并取出不是以相同的开始和位置在同一位置开始的数据。

我通过以下yarn命令多次运行了此查询，每次都得到不同的结果。

yarn command - spark-submit --master yarn --deploy-mode client --driver-memory 4g --executor-memory 6g --executor-cores 3 --num-executors 4

所以我运行了两个没有联接条件的数据集，每次都得到相同的计数。当我联接两个具有startclusterid和endclusterid的数据集时，结果会有所不同。但是当我通过spark-submit --master local [4]命令运行脚本时，结果没有改变。

我正在使用DBSCAN算法对起点和终点的纬度，经度和返回的clusterid，Clustercellid进行聚类。当我多次通过纱线运行时，给定聚类得到的聚类ID不同，但是我们认为对于给定的聚类ID不会改变会议。

为位置“ A”启动簇-第一次穿过纱线

 startcluterid               startcellid

  1                           11126
  1                           11127

为位置“ A”启动簇-第二次穿过纱线

 startcluterid        startcellid

  5                  11126
  5                  11127

聚类之前的初始数据集，

  TransactionId  CustomerId  StartLat           StartLon           EndLat   EndLon

 17471146           590      41.890334            12.854832        41.91075183          12.86703281
 17540917           590      41.890347            12.854828        41.91041441          12.86689
 18972483           590      41.890389            12.854123        41.91134124           12.86684897
 19037116           590      41.890358            12.854846         41.9107199           12.8671107
 20315292           590      41.8903541           12.85485          41.9107082           12.8672354
 20422794           590      41.890337            12.854812        41.91074152           12.867081
 20458932           590      41.8904              12.854815         41.9107416           12.86717336
 25902100           590      41.890329            12.854836        41.91074148           12.86704109
 29829078           590      41.89034             12.8548             41.91074            12.867
 30024741           590      41.89035             12.8548             41.91078            12.867

有人可以让我知道是什么问题吗？

当通过局部和通过纱线运行火花时，结果会有所不同吗？

0 个答案: