pyspark使用sklearn.DBSCAN在本地提交spark作业后收到错误-Java 学习之路

我在我的pyspark工作中使用了sklearn.DBSCAN . 请参阅下面的代码段 . 我还在deps.zip文件中压缩了所有依赖模块，该文件被添加到SparkContext中 .

from sklearn.cluster import DBSCAN
import numpy as np
#import pyspark
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql.types import DoubleType
from pyspark.sql import Row

def dbscan_latlng(lat_lngs,mim_distance_km,min_points=10):

coords = np.asmatrix(lat_lngs)
kms_per_radian = 6371.0088
epsilon = mim_distance_km/ kms_per_radian
db = DBSCAN(eps=epsilon, min_samples=min_points, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_
num_clusters = len(set(cluster_labels))
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
maxClusters = clusters.map(len).max()
if (maxClusters > 3):
  dfClusters = clusters.to_frame('coords')
  dfClusters['length'] = dfClusters.apply(lambda x: len(x['coords']), axis=1)
  custCluster = dfClusters[dfClusters['length']==maxClusters].reset_index()
  return custCluster['coords'][0].tolist()

sc = SparkContext()
sc.addPyFile('/content/airflow/dags/deps.zip')
sqlContext = SQLContext(sc)

但是，在我通过spark-submit -master local [4] FindOutliers.py提交作业后，我得到以下python错误，说sklearn / __ check_build不是目录 . 谁能帮我这个？非常感谢！

引起：org.apache.spark.api.python.PythonException：Traceback（最近一次调用最后一次）：文件“/root/.virtualenvs/jacob/local/lib/python2.7/site-packages/pyspark/python/ lib / pyspark.zip / pyspark / worker.py“，第166行，在main func，profiler，deserializer，serializer = read_command（pickleSer，infile）文件”/root/.virtualenvs/jacob/local/lib/python2.7/ site-packages / pyspark / python / lib / pyspark.zip / pyspark / worker.py“，第55行，在read_command命令= serializer._read_with_length（file）文件”/root/.virtualenvs/jacob/local/lib/python2 . 7 / site-packages / pyspark / python / lib / pyspark.zip / pyspark / serializers.py“，第169行，在_read_with_length中返回self.loads（obj）文件”/root/.virtualenvs/jacob/local/lib/python2 .7 / site-packages / pyspark / python / lib / pyspark.zip / pyspark / serializers.py“，第454行，在load中返回pickle.loads（obj）文件”/ tmp / pip-build -OLOGnWWw / scikit-learn /sklearn/init.py“，第133行，在文件”/tmp/pip-build-0qnWWw/scikit-learn/sklearn/check_build/__init.py“，第46行，在文件”/ tmp / pip-在raise_build_error中，第26行，编译-schenWWw / scikit-learn / sklearn / check_build / __ init.py“OSError：[Errno 20]不是目录：'/ tmp / spark-beb8777f-b7d5-40be-a72b-c16e10264a50 / userFiles- 3762d9c0-6674-467a-949b-33968420bae1 / deps.zip / sklearn / __ check_build”

1 回答

试试：

import pyspark as ps

sc = ps.SparkContext()
sc.addPyFile('/content/airflow/dags/deps.zip')
sqlContext = ps.SQLContext

回复于 2024-04-29T11:03:17+08:00

pyspark使用sklearn.DBSCAN在本地提交spark作业后收到错误

1 回答

相关问题