Spark Scala:在接近位置和时间范围内加入两个Dataframe

时间:2018-04-04 09:42:31

标签: scala apache-spark spark-dataframe

我有两个Dataframe:

  1. 具有以下结构的数据框private void uploadImage(final String image_path) { stringRequest = new StringRequest(Request.Method.POST, UPLOAD_URL, new Response.Listener<String>() { @Override public void onResponse(String s) { //Disimissing the progress dialog loading.dismiss(); //Showing toast message of the response Toast.makeText(ImageUploadActivity.this, "Success", Toast.LENGTH_SHORT).show(); Intent i = new Intent(ImageUploadActivity.this, CameraActivity.class); startActivity(i); finish(); } }, new Response.ErrorListener() { @Override public void onErrorResponse(VolleyError volleyError) { //Showing toast Toast.makeText(ImageUploadActivity.this, volleyError.getMessage(), Toast.LENGTH_SHORT).show(); //Dismissing the progress dialog loading.dismiss(); } }) { @Override protected Map<String, String> getParams() throws AuthFailureError { String img_path = image_path; //Creating parameters Map<String, String> parameters = new Hashtable<String, String>(); //Adding parameters parameters.put("postedFile", img_path); parameters.put("folderName", estimateNumber.getText().toString()); parameters.put("catName", EstimateID.getText().toString()); parameters.put("estNumber", estimateNumber.getText().toString()); parameters.put("userName", uploadDestination.getText().toString().concat("/Images")); parameters.put("imageUploadID", "2017-10-09 01.43.36 PM"); parameters.put("dateTime", "2017-10-09 01.43 PM"); parameters.put("type", "image"); parameters.put("accessToken", "fgd3Rh@lcr"); parameters.put("fileName", estNumber.getText().toString().concat(fileNameList.get(j))); j++; //returning parameters return parameters; } }; //Creating a Request Queue RequestQueue requestQueue = Volley.newRequestQueue(this); //Adding request to the queue requestQueue.add(stringRequest); } DF1

  2. 数据框(ID, StartDate, EndDate, Position),如下所示:DF2

  3. 我想使用那些Dataframes创建一个新的,包含每个DF1(ID),DF2中的行数,DF2(DateTime)在DF1(StartDate)和DF1(EndDate)和DF2(位置)之间)靠近DF1(位置)

    我们可以假设我有一个udf函数(DateTime, Position)来完成比较职位。

    我目前正试图通过我的数据框之间的连接来做到这一点,但它似乎不是正确的解决方案

    编辑2:

    这是一个MVCE:

    isNearUDF(pos1,pos2)

    这项工作适用于约会,但需要很长时间。 当我添加isInRadius条件时,我收到错误:

    def isInRadius(lat1:Double,lon1:Double,lat2:Double,lon2:Double,dist:Double):Boolean={
      val distance = 0// calculate distance between lon/lat positions
    
      return distance<=dist
    }
    
    val DF1 = sc.parallelize(Array(
      ("ID1", "2018-02-27T13:47:59.416+01:00", "2018-03-01T16:02:00.632+01:00", "25.13297154663", "55.13297154663"),
      ("ID2", "2018-02-25T13:47:59.416+01:00", "2018-02-07T16:02:00.632+01:00", "26.13297154663", "55.13297154663"),
      ("ID3", "2018-02-24T13:47:59.416+01:00", "2018-02-02T16:02:00.632+01:00", "25.13297154663", "55.13297154663")
    // ...
    )).toDF("ID", "CreationDate","EndDate","Lat1","Lon1")
    
    val DF2 = sc.parallelize(Array(
      ("2018-02-27T13:47:59.416+01:00","25.13297154663", "55.13297154663"),
      ("2018-02-27T13:47:59.416+01:00","25.1304663", "54.10663"),
      ("2018-02-27T13:47:59.416+01:00","25.1354663", "55.132904663")
      // ...
    )).toDF("DateTime","Lat2","Lon2")
    
    val isInRadiusUdf = udf(isInRadius _)
    
    val DF3 = DF1.join(DF2,$"DateTime">=$"CreationDate" && $"DateTime"<=$"EndDate" /*&& isInRadiusUdf($"Lat1",$"Lon1",$"Lat2",$"Lon2",lit(10))*/)
    
    display(DF3)
    

2 个答案:

答案 0 :(得分:0)

尝试将您的功能定义更改为:

def isInRadius : Double => Double => Double => Double => Double = lat1 => long1 => lat2 => long2 => dist {
  val distance = // calculate distance between lon/lat positions

  return distance<=dist
}

答案 1 :(得分:0)

在尝试了各种可能的解决方案并获得奇怪的结果后,我终于通过简单地重新启动我的Spark Cluster(Databricks Notebook)来解决我的问题 我完全不知道问题是什么,但现在MVCE的代码工作了。