Question

我试图在一个只有22MB的示例文档上运行带有Spark的K-means，并且我遇到了Java堆空间错误。有什么想法吗？它在集群线上失败。

示例数据和代码位于我的github

# run in ipython spark shell, IPYTHON=1 pyspark

from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
import json
from pyspark.sql import SQLContext, Row


sqlContext = SQLContext(sc)
sample = sqlContext.read.json("/home/ubuntu/yelp_project/sample.json")
sample.registerTempTable("sample")
reviews = sample.map(lambda x: Row(name= x[1], reviews=' '.join((a[3] for a in       x[0])))) 


hashingTF = HashingTF()
tf = hashingTF.transform(reviews.map(lambda x: x.reviews))
clusters = KMeans.train(tf, 2, maxIterations=10, runs=10, initializationMode="random")

Answer 1

问题是我的文档非常大，并且功能的数量太大而无法存储在为spark过程分配的内存中。为了解决这个问题，我用最多的功能初始化了我的HashingTF：

hashingTF = HashingTF(5000)

运行K的Spark的Java堆空间错误意味着在EC2实例上

1 个答案: