考虑到事件的经度和纬度,我试图在PySpark中推断出时区。我遇到了本地工作的timezonefinder
库。我将其包装在用户定义的函数中,试图将其用作时区推理器。
var tableColumns = [
{headerName: "column1", field: "c1", editable: false},
{headerName: "column2", field: "c2", editable: false},
{headerName: "column3", field: "c3", editable: false},
{headerName: "column4", field: "c4", editable: false}
];
var newData = {};
tableColumns.forEach((col)=>{
newData[col.headerName] = '';
});
console.log(newData);
当我在单个节点上运行时,此代码有效。但是,当并行运行时,我收到以下错误:
def get_timezone(longitude, latitude):
from timezonefinder import TimezoneFinder
tzf = TimezoneFinder()
return tzf.timezone_at(lng=longitude, lat=latitude)
udf_timezone = F.udf(get_timezone, StringType())
df = sqlContext.read.parquet(INPUT)
df.withColumn("local_timezone", udf_timezone(df.longitude, df.latitude))\
.write.parquet(OUTPUT)
我可以在我收到错误的节点上本地导入库。 沿着这些方向的任何解决方案将不胜感激:
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 104, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1525907011747_0007/container_1525907011747_0007_01_000062/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
return lambda *a: f(*a)
File "/tmp/c95422912bfb4079b64b88427991552a/enrich_data.py", line 64, in get_timezone
File "/opt/conda/lib/python2.7/site-packages/timezonefinder/__init__.py", line 3, in <module>
from .timezonefinder import TimezoneFinder
File "/opt/conda/lib/python2.7/site-packages/timezonefinder/timezonefinder.py", line 59, in <module>
from .helpers_numba import coord2int, int2coord, distance_to_polygon_exact, distance_to_polygon, inside_polygon, \
File "/opt/conda/lib/python2.7/site-packages/timezonefinder/helpers_numba.py", line 17, in <module>
@jit(b1(i4, i4, i4[:, :]), nopython=True, cache=True)
File "/opt/conda/lib/python2.7/site-packages/numba/decorators.py", line 191, in wrapper
disp.enable_caching()
File "/opt/conda/lib/python2.7/site-packages/numba/dispatcher.py", line 529, in enable_caching
self._cache = FunctionCache(self.py_func)
File "/opt/conda/lib/python2.7/site-packages/numba/caching.py", line 614, in __init__
self._impl = self._impl_class(py_func)
File "/opt/conda/lib/python2.7/site-packages/numba/caching.py", line 349, in __init__
"for file %r" % (qualname, source_path))
RuntimeError: cannot cache function 'inside_polygon': no locator available for file '/opt/conda/lib/python2.7/site-packages/timezonefinder/helpers_numba.py'
呢?答案 0 :(得分:0)
最终,通过完全放弃timezonefinder
来解决此问题,而是使用timezone-boundary-builder
中的地理空间时区数据集,而使用magellan
进行查询的地理空间sql查询库火花。
我要注意的一个事实是,库中的Point
和其他对象不是为Python包装的。我最终编写了自己的用于时区匹配的scala函数,并在返回数据帧之前从magellan
中删除了对象。
答案 1 :(得分:0)
在spark集群上运行timezonefinder时遇到此错误。
RuntimeError: cannot cache function 'inside_polygon': no locator available for file '/disk-1/hadoop/yarn/local/usercache/timezonefinder1.zip/timezonefinder/helpers_numba.py'
问题在于,我们发布给Spark的集群和timezonefinder软件包的numpy版本有所不同。 群集具有numpy-1.13.3,其中timezonefinder.zip中的numpy为1.17.2。
为解决版本不匹配的问题,我们使用timezonefinder和numpy 1.17.2创建了一个自定义conda环境,并使用自定义conda环境提交了spark作业。
在安装了timezonefinder程序包的情况下创建自定义Conda环境:
conda create --name timezone-conda python timezonefinder
源激活时区conda
conda install -y conda-pack
conda pack -o timezonecondaevnv.tar.gz -d ./MY_CONDA_ENV
使用自定义conda环境提交Spark作业:
!spark-submit --name app_name \
--master yarn \
--deploy-mode cluster \
--driver-memory 1024m \
--executor-memory 1GB \
--executor-cores 5 \
--num-executors 10 \
--queue QUEUE_NAME\
--archives ./timezonecondaevnv.tar.gz#MY_CONDA_ENV \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.executorEnv.PYSPARK_PYTHON=./MY_CONDA_ENV/bin/python \
--conf spark.executorEnv.PYSPARK_DRIVER_PYTHON=./MY_CONDA_ENV/bin/python \
./main.py