Pyspark Py4JJavaError:和OutOfMemoryError时发生错误

时间:2019-02-23 18:47:12

标签: python-2.7 pyspark

当我在Pyspark上工作时,使用任何ML算法时都会收到Java Heap Space错误。我拥有的数据是200 MB,正在使用的计算机是32GB RAM。我想知道问题可能如下。你能帮我吗?

我的数据是基于文本的。我想使用此数据进行计算。有20万行。我可以计算25行,但是当我尝试计算超过25,000行时,出现Java堆空间错误。

    mySchema = StructType([ StructField("column1", IntegerType(), True)\
                       ,StructField("column2", StringType(), True)\
                       ,StructField("column3", IntegerType(), True)\
                       ,StructField("column4", StringType(), True)\
                       ,StructField("column5", StringType(), True)\
                       ,StructField("column6", StringType(), True)\
                       ,StructField("column7", IntegerType(), True)\
                       ,StructField("column8", StringType(), True)\
                       ,StructField("column9", StringType(), True)\
                       ,StructField("column10", StringType(), True)\
                       ,StructField("column11", StringType(), True)\
                       ,StructField("column12", IntegerType(), True)\
                       ,StructField("column13", StringType(), True)\
                       ,StructField("column14", StringType(), True)\
                       ,StructField("column15", StringType(), True)])

data_CSV=pd.read_csv("C:/data.csv", usecols=[7, 8, 9, 10, 12, 18, 28, 29, 35, 36, 58, 81, 82, 83, 84],low_memory=False)

catcols = ['column2','column4','column5','column6']
num_cols = ['column1', 'column3','column7','column12']
labelCol = 'column11'
sqlContext = SQLContext(sc)
spark_df = sqlContext.createDataFrame(data_CSV, schema=mySchema)


def get_dummy(df,categoricalCols,continuousCols,labelCol):

    from pyspark.ml import Pipeline
    from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
    from pyspark.sql.functions import col

    indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
                 for c in categoricalCols ]

    # default setting: dropLast=True
    encoders = [ OneHotEncoder(inputCol=indexer.getOutputCol(),
                 outputCol="{0}_encoded".format(indexer.getOutputCol()))
                 for indexer in indexers ]

    assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders]
                                + continuousCols, outputCol="features")

    pipeline = Pipeline(stages=indexers + encoders + [assembler])

    model=pipeline.fit(df)
    data = model.transform(df)
    data = data.withColumn('label',col(labelCol))

    return data.select('features','label')

data_f = get_dummy(spark_df,catcols,num_cols,labelCol)
data_f.show(5)
labelIndexer = StringIndexer(inputCol='label',outputCol='indexedLabel').fit(data_f)
labelIndexer.transform(data_f).show(5, True)
featureIndexer =VectorIndexer(inputCol="features", outputCol="indexedFeatures",maxCategories=4).fit(data_f)
featureIndexer.transform(data_f).show(5, True)

(trainingData, testData) = data_f.randomSplit([0.7, 0.3], seed=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",labels=labelIndexer.labels)

print("Logistic Regression")
logr = LogisticRegression(featuresCol='indexedFeatures', labelCol='indexedLabel',maxIter=20, regParam=0.3)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, logr, labelConverter])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
predictions.select("features", "label", "predictedLabel", "probability").show(5)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("True = %g" % (accuracy))
print("Test Error = %g" % (1.0 - accuracy))

1 个答案:

答案 0 :(得分:0)

增加您的Spark会话的默认配置。您基本上需要通过以下方式增加驱动程序内存

spark = SparkSession.builder.master("local[*]").config('spark.executor.heartbeatInterval', '500s') \
        .config('spark.driver.memory', '12g').config("spark.driver.bindAddress", "localhost") \
        .config('spark.executor.memory', '12g').config("spark.network.timeout", "2000s").getOrCreate()

您使用的是哪个版本的Spark?如果您使用的是SparkContextSQLContext,则可以将SparkConf对象及其配置传递给SparkContext

您还可以在spark-defaults.conf文件夹中编辑$SPARK_HOME/conf/