Question

我想在由三列构成的数据帧上应用随机森林算法，即JournalID，IndexedJournalID（使用Spark的StringIndexer获得）和特征向量。我使用以下代码从镶木地板文件中读取数据帧，并在JournalID列上应用String Indexer将其转换为分类类型。

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql.functions import udf
from pyspark.ml.linalg import Vectors
from pyspark.ml.linalg import VectorUDT

df=spark.read.parquet('JouID-UBTFIDFVectors-server22.parquet')
labelIndexer = StringIndexer(inputCol="journalid", outputCol="IndexedJournalID")

labelsDF=labelIndexer.fit(df)
df1=labelsDF.transform(df)

# This function converts sparse vectors to dense vectors....I applied this on raw features column to convert them to VectorUDT type.....
parse_ = udf(lambda l: Vectors.dense(l), VectorUDT())

df2 = df1.withColumn("featuresNew", parse_(df1["features"])).drop('features')

新数据框架构（df2）如下：

root
 |-- journalid: string (nullable = true)
 |-- indexedLabel: double (nullable = false)
 |-- featuresNew: vector (nullable = true)

然后我将df2分为训练和测试集，并创建随机森林分类器的对象，如下所示：

(trainingData, testData) = df2.randomSplit([0.8, 0.2])
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="featuresNew", numTrees=2 )

最后对上面获得的trainingData应用fit（）方法。

rfModel=rf.fit(trainingData)

有了这个，我能够在100个输入数据帧实例上训练模型。但是，在整个训练数据上，此行给出以下错误。

----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 53652)
Traceback (most recent call last):
  File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 317, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 348, in process_request
    self.finish_request(request, client_address)
  File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 361, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/data/sntps/code/conda3/lib/python3.6/socketserver.py", line 696, in __init__
    self.handle()
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle
    num_updates = read_int(self.rfile)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/serializers.py", line 685, in read_int
    raise EOFError
EOFError
----------------------------------------
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command
    response = connection.send_command(command)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:41060)
Traceback (most recent call last):
  File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-46d7488961c7>", line 1, in <module>
    rfModel=rf.fit(trainingData)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/base.py", line 132, in fit
    return self._fit(dataset)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 288, in _fit
    java_model = self._fit_java(dataset)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 285, in _fit_java
    return self._java_obj.fit(dataset._jdf)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value
    format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o90.fit

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/sntps/code/conda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1828, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'Py4JError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/sp/spark-2.3.1-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 929, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:
.(traceback...not writing due to space issue)
.
.  
Py4JError: An error occurred while calling o90.fit

此错误的描述性不是很高，因此，我很难确定我要去哪里。任何帮助都会有很大帮助。

输入说明：输入数据框包含2696512行，每行的特征向量的长度为262144。

Answer 1

经历了关于stackoverflow的许多相关问题之后，我认为这可能是由于在jupyter-notebook中运行而引起的。因此，稍后我使用spark-submit脚本在命令行上运行了该命令，并且不再遇到此错误。我不知道如果在jupyter-notebook中运行此错误，为什么会弹出此错误。

Py4JError：调用o90.fit时发生错误

1 个答案: