pyspark使用sklearn.DBSCAN在本地提交火花作业后收到错误

时间:2017-11-27 18:11:01

标签: python apache-spark pyspark

我在我的pyspark工作中使用了sklearn.DBSCAN。请参阅下面的代码段。我还在deps.zip文件中压缩了所有依赖模块,该文件被添加到SparkContext。

from sklearn.cluster import DBSCAN
import numpy as np
#import pyspark
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql.types import DoubleType
from pyspark.sql import Row

def dbscan_latlng(lat_lngs,mim_distance_km,min_points=10):

coords = np.asmatrix(lat_lngs)
kms_per_radian = 6371.0088
epsilon = mim_distance_km/ kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=min_points, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
maxClusters = clusters.map(len).max()
if (maxClusters > 3):
  dfClusters = clusters.to_frame('coords')
  dfClusters['length'] = dfClusters.apply(lambda x: len(x['coords']), axis=1)
  custCluster = dfClusters[dfClusters['length']==maxClusters].reset_index()
  return custCluster['coords'][0].tolist()

sc = SparkContext()
sc.addPyFile('/content/airflow/dags/deps.zip')
sqlContext = SQLContext(sc)

但是,在我通过spark-submit -master local [4] FindOutliers.py提交作业后,我得到以下python错误,说sklearn / __ check_build不是目录。谁能帮我这个?非常感谢!

  

引起:org.apache.spark.api.python.PythonException:Traceback   (最近一次调用最后一次):文件   " /root/.virtualenvs/jacob/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py" ;,   第166行,主要       func,profiler,deserializer,serializer = read_command(pickleSer,infile)文件   " /root/.virtualenvs/jacob/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py" ;,   第55行,在read_command中       command = serializer._read_with_length(file)File" /root/.virtualenvs/jacob/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py" ;,   第169行,在_read_with_length中       return self.loads(obj)File" /root/.virtualenvs/jacob/local/lib/python2.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py" ,   第454行,载入中       返回pickle.loads(obj)文件" / tmp / pip-build-MEDICnWWw / scikit-learn / sklearn / init .py",第133行,in      文件   " / TMP / PIP-积聚0qnWWw / scikit学习/ sklearn /的 check_build / __初始化的.py&#34 ;,   第46行,在文件中   " / TMP / PIP-积聚0qnWWw / scikit学习/ sklearn /的 check_build / __初始化的.py&#34 ;,   第26行,在raise_build_error中OSError:[Errno 20]不是目录:   ' /tmp/spark-beb8777f-b7d5-40be-a72b-c16e10264a50/userFiles-3762d9c0-6674-467a-949b-33968420bae1/deps.zip/sklearn / __ check_build'

1 个答案:

答案 0 :(得分:0)

尝试:

import pyspark as ps

sc = ps.SparkContext()
sc.addPyFile('/content/airflow/dags/deps.zip')
sqlContext = ps.SQLContext