主要代码(sdf
表示“火花数据帧”):
left_bound, bottom_bound, _, _ = box.bounds
def grid_lon_lat_to_1_nautical_mile(lon_utm: float, lat_utm: float) -> Tuple[int, int]:
lon_grid = int((lon_utm - left_bound) // 1852)
lat_grid = int((lat_utm - bottom_bound) // 1852)
return lon_grid, lat_grid
def grid_sdf(sdf: DataFrame) -> DataFrame:
udf_grid = F.udf(lambda lon_utm, lat_utm: grid_lon_lat_to_1_nautical_mile(lon_utm, lat_utm),
reproject_and_grid_schema)
return (sdf
.withColumn("grid", udf_grid("lon_utm", "lat_utm"))
.select("grid.lon_grid", "grid.lat_grid"))
Bizarrely,当以两种方式运行时,我得到不同的输出:
grid_sdf(sdf.cache())
产生正确的(DataFrame)输出。 grid_sdf(sdf)
产生:
...
Py4JJavaError: An error occurred while calling o250.showString.
...
File "grid.py", line 139, in grid_ais_to_1_nautical_mile
lon_grid = int((lon_utm - left_bound) // 1852)
ValueError: cannot convert float NaN to integer
我觉得我了解.cache()
的功能,但是我不知道为什么它应该防止此错误?