当我在Pyspark上工作时,使用任何ML算法时都会收到Java Heap Space错误。我拥有的数据是200 MB,正在使用的计算机是32GB RAM。我想知道问题可能如下。你能帮我吗?
我的数据是基于文本的。我想使用此数据进行计算。有20万行。我可以计算25行,但是当我尝试计算超过25,000行时,出现Java堆空间错误。
mySchema = StructType([ StructField("column1", IntegerType(), True)\
,StructField("column2", StringType(), True)\
,StructField("column3", IntegerType(), True)\
,StructField("column4", StringType(), True)\
,StructField("column5", StringType(), True)\
,StructField("column6", StringType(), True)\
,StructField("column7", IntegerType(), True)\
,StructField("column8", StringType(), True)\
,StructField("column9", StringType(), True)\
,StructField("column10", StringType(), True)\
,StructField("column11", StringType(), True)\
,StructField("column12", IntegerType(), True)\
,StructField("column13", StringType(), True)\
,StructField("column14", StringType(), True)\
,StructField("column15", StringType(), True)])
data_CSV=pd.read_csv("C:/data.csv", usecols=[7, 8, 9, 10, 12, 18, 28, 29, 35, 36, 58, 81, 82, 83, 84],low_memory=False)
catcols = ['column2','column4','column5','column6']
num_cols = ['column1', 'column3','column7','column12']
labelCol = 'column11'
sqlContext = SQLContext(sc)
spark_df = sqlContext.createDataFrame(data_CSV, schema=mySchema)
def get_dummy(df,categoricalCols,continuousCols,labelCol):
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import col
indexers = [ StringIndexer(inputCol=c, outputCol="{0}_indexed".format(c))
for c in categoricalCols ]
# default setting: dropLast=True
encoders = [ OneHotEncoder(inputCol=indexer.getOutputCol(),
outputCol="{0}_encoded".format(indexer.getOutputCol()))
for indexer in indexers ]
assembler = VectorAssembler(inputCols=[encoder.getOutputCol() for encoder in encoders]
+ continuousCols, outputCol="features")
pipeline = Pipeline(stages=indexers + encoders + [assembler])
model=pipeline.fit(df)
data = model.transform(df)
data = data.withColumn('label',col(labelCol))
return data.select('features','label')
data_f = get_dummy(spark_df,catcols,num_cols,labelCol)
data_f.show(5)
labelIndexer = StringIndexer(inputCol='label',outputCol='indexedLabel').fit(data_f)
labelIndexer.transform(data_f).show(5, True)
featureIndexer =VectorIndexer(inputCol="features", outputCol="indexedFeatures",maxCategories=4).fit(data_f)
featureIndexer.transform(data_f).show(5, True)
(trainingData, testData) = data_f.randomSplit([0.7, 0.3], seed=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel",labels=labelIndexer.labels)
print("Logistic Regression")
logr = LogisticRegression(featuresCol='indexedFeatures', labelCol='indexedLabel',maxIter=20, regParam=0.3)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, logr, labelConverter])
model = pipeline.fit(trainingData)
predictions = model.transform(testData)
predictions.select("features", "label", "predictedLabel", "probability").show(5)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("True = %g" % (accuracy))
print("Test Error = %g" % (1.0 - accuracy))
答案 0 :(得分:0)
增加您的Spark会话的默认配置。您基本上需要通过以下方式增加驱动程序内存
spark = SparkSession.builder.master("local[*]").config('spark.executor.heartbeatInterval', '500s') \
.config('spark.driver.memory', '12g').config("spark.driver.bindAddress", "localhost") \
.config('spark.executor.memory', '12g').config("spark.network.timeout", "2000s").getOrCreate()
您使用的是哪个版本的Spark?如果您使用的是SparkContext
和SQLContext
,则可以将SparkConf
对象及其配置传递给SparkContext
您还可以在spark-defaults.conf
文件夹中编辑$SPARK_HOME/conf/