PySpark-带有和不带有.cache()的不同行为

时间:2018-08-22 21:57:28

标签: apache-spark pyspark apache-spark-sql user-defined-functions pyspark-sql

主要代码(sdf表示“火花数据帧”):

left_bound, bottom_bound, _, _ = box.bounds    

def grid_lon_lat_to_1_nautical_mile(lon_utm: float, lat_utm: float) -> Tuple[int, int]:

    lon_grid = int((lon_utm - left_bound) // 1852)
    lat_grid = int((lat_utm - bottom_bound) // 1852)

    return lon_grid, lat_grid

def grid_sdf(sdf: DataFrame) -> DataFrame:

    udf_grid = F.udf(lambda lon_utm, lat_utm: grid_lon_lat_to_1_nautical_mile(lon_utm, lat_utm),
                    reproject_and_grid_schema)

    return (sdf
            .withColumn("grid", udf_grid("lon_utm", "lat_utm"))
            .select("grid.lon_grid", "grid.lat_grid"))

Bizarrely,当以两种方式运行时,我得到不同的输出:

  • grid_sdf(sdf.cache())产生正确的(DataFrame)输出。
  • grid_sdf(sdf)产生:

    ...
    Py4JJavaError: An error occurred while calling o250.showString.
    ...
    File "grid.py", line 139, in grid_ais_to_1_nautical_mile
      lon_grid = int((lon_utm - left_bound) // 1852)
    ValueError: cannot convert float NaN to integer
    

我觉得我了解.cache()的功能,但是我不知道为什么它应该防止此错误?

0 个答案:

没有答案