Pyspark按位置推断时区

时间:2018-05-10 07:03:07

标签: apache-spark pyspark timezone

考虑到事件的经度和纬度,我试图在PySpark中推断出时区。我遇到了本地工作的timezonefinder库。我将其包装在用户定义的函数中,试图将其用作时区推理器。

var tableColumns = [
    {headerName: "column1", field: "c1", editable: false},
    {headerName: "column2", field: "c2", editable: false},
    {headerName: "column3", field: "c3", editable: false},
    {headerName: "column4", field: "c4", editable: false}
];

var newData = {};

tableColumns.forEach((col)=>{
  newData[col.headerName] = '';
});

console.log(newData);

当我在单个节点上运行时,此代码有效。但是,当并行运行时,我收到以下错误:

def get_timezone(longitude, latitude):
    from timezonefinder import TimezoneFinder
    tzf = TimezoneFinder()
    return tzf.timezone_at(lng=longitude, lat=latitude)

udf_timezone = F.udf(get_timezone, StringType())

df = sqlContext.read.parquet(INPUT)
df.withColumn("local_timezone", udf_timezone(df.longitude, df.latitude))\
  .write.parquet(OUTPUT)

我可以在我收到错误的节点上本地导入库。 沿着这些方向的任何解决方案将不胜感激:

  • 是否有本地Spark来完成任务?
  • 还有另一种加载库的方法吗?
  • 有没有办法避免缓存 File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 177, in main process() File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 104, in <lambda> func = lambda _, it: map(mapper, it) File "<string>", line 1, in <lambda> File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 71, in <lambda> return lambda *a: f(*a) File "/tmp/c95422912bfb4079b64b88427991552a/enrich_data.py", line 64, in get_timezone File "/opt/conda/lib/python2.7/site-packages/timezonefinder/__init__.py", line 3, in <module> from .timezonefinder import TimezoneFinder File "/opt/conda/lib/python2.7/site-packages/timezonefinder/timezonefinder.py", line 59, in <module> from .helpers_numba import coord2int, int2coord, distance_to_polygon_exact, distance_to_polygon, inside_polygon, \ File "/opt/conda/lib/python2.7/site-packages/timezonefinder/helpers_numba.py", line 17, in <module> @jit(b1(i4, i4, i4[:, :]), nopython=True, cache=True) File "/opt/conda/lib/python2.7/site-packages/numba/decorators.py", line 191, in wrapper disp.enable_caching() File "/opt/conda/lib/python2.7/site-packages/numba/dispatcher.py", line 529, in enable_caching self._cache = FunctionCache(self.py_func) File "/opt/conda/lib/python2.7/site-packages/numba/caching.py", line 614, in __init__ self._impl = self._impl_class(py_func) File "/opt/conda/lib/python2.7/site-packages/numba/caching.py", line 349, in __init__ "for file %r" % (qualname, source_path)) RuntimeError: cannot cache function 'inside_polygon': no locator available for file '/opt/conda/lib/python2.7/site-packages/timezonefinder/helpers_numba.py' 呢?

2 个答案:

答案 0 :(得分:0)

最终,通过完全放弃timezonefinder来解决此问题,而是使用timezone-boundary-builder中的地理空间时区数据集,而使用magellan进行查询的地理空间sql查询库火花。

我要注意的一个事实是,库中的Point和其他对象不是为Python包装的。我最终编写了自己的用于时区匹配的scala函数,并在返回数据帧之前从magellan中删除了对象。

答案 1 :(得分:0)

在spark集群上运行timezonefinder时遇到此错误。

RuntimeError: cannot cache function 'inside_polygon': no locator available for file '/disk-1/hadoop/yarn/local/usercache/timezonefinder1.zip/timezonefinder/helpers_numba.py'

问题在于,我们发布给Spark的集群和timezonefinder软件包的numpy版本有所不同。 群集具有numpy-1.13.3,其中timezonefinder.zip中的numpy为1.17.2。

为解决版本不匹配的问题,我们使用timezonefinder和numpy 1.17.2创建了一个自定义conda环境,并使用自定义conda环境提交了spark作业。

在安装了timezonefinder程序包的情况下创建自定义Conda环境:

  1. conda create --name timezone-conda python timezonefinder

  2. 源激活时区conda

  3. conda install -y conda-pack

  4. conda pack -o timezonecondaevnv.tar.gz -d ./MY_CONDA_ENV

https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands

使用自定义conda环境提交Spark作业:

!spark-submit --name app_name \
--master yarn \
--deploy-mode cluster \
--driver-memory 1024m \
--executor-memory 1GB \
--executor-cores 5 \
--num-executors 10 \
--queue QUEUE_NAME\
--archives ./timezonecondaevnv.tar.gz#MY_CONDA_ENV \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./MY_CONDA_ENV/bin/python \
./main.py