
时间:2018-07-31 09:48:44

标签: pyspark yarn data-science

joined_data_filtered = spark.sql("SELECT T.TransactionId, T.CustomerId, T.StartClusterId, T.EndCentermostClusterId, T.EndClusterId, T.StartCellId, T.EndCellId, T.EndCentermostCellId, T.EndCentermostLatitude, T.EndCentermostLongitude FROM joined_trip_data AS T LEFT JOIN(SELECT CustomerId,StartClusterId,EndClusterId FROM joined_trip_data WHERE EndCellId = StartCellId GROUP BY CustomerId, StartClusterId,EndClusterId) AS D ON T.CustomerId = D.CustomerId AND T.StartClusterId = D.StartClusterId AND T.EndClusterId = D.EndClusterId WHERE D.CustomerId IS NULL")



yarn command - spark-submit --master yarn --deploy-mode client --driver-memory 4g --executor-memory 6g --executor-cores 3 --num-executors 4

所以我运行了两个没有联接条件的数据集,每次都得到相同的计数。当我联接两个具有startclusterid和endclusterid的数据集时,结果会有所不同。但是当我通过spark-submit --master local [4]命令运行脚本时,结果没有改变。


为位置“ A”启动簇-第一次穿过纱线

 startcluterid               startcellid

  1                           11126
  1                           11127

为位置“ A”启动簇-第二次穿过纱线

 startcluterid        startcellid

  5                  11126
  5                  11127


  TransactionId  CustomerId  StartLat           StartLon           EndLat   EndLon

 17471146           590      41.890334            12.854832        41.91075183          12.86703281
 17540917           590      41.890347            12.854828        41.91041441          12.86689
 18972483           590      41.890389            12.854123        41.91134124           12.86684897
 19037116           590      41.890358            12.854846         41.9107199           12.8671107
 20315292           590      41.8903541           12.85485          41.9107082           12.8672354
 20422794           590      41.890337            12.854812        41.91074152           12.867081
 20458932           590      41.8904              12.854815         41.9107416           12.86717336
 25902100           590      41.890329            12.854836        41.91074148           12.86704109
 29829078           590      41.89034             12.8548             41.91074            12.867
 30024741           590      41.89035             12.8548             41.91078            12.867


0 个答案:
