我在使用StreamingKMeans时遇到问题。下面的最小代码段将向量批次加载到Streaming Context中,并在其上训练StreamingKMeans模型。
然后,该模型用于预测当前微批次和先前微批次的聚类成员,给出聚类分配的谱系,可用于跟踪质心随时间的演变。
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.mllib.linalg import DenseVector
from operator import add
import numpy as np
from pyspark.mllib.clustering import StreamingKMeans
n = 1000000 # num examples
m = 500 # batch size
d = 30 # dimensionality
k = 10 # as in k-means
sc = SparkContext()
ssc = StreamingContext(sc, 1)
def get_batches():
data = [DenseVector(v) for v in np.random.rand(n, d).tolist()]
for i in range(0, n, m):
yield data[i:i+m]
microbatches = ssc.queueStream(list(get_batches()))
window = microbatches.window(windowDuration=2, slideDuration=1)
model = StreamingKMeans(k=k, decayFactor=0.9).setRandomCenters(d, 1.0, 0)
model.trainOn(microbatches)
results = model.predictOn(window)
# Arbitrary action to force results to be computed
results.map(lambda x: 1).reduce(add).pprint()
ssc.start()
ssc.awaitTerminationOrTimeout(84000)
如果$ predictOn()$被注释掉,该模型运行良好,Java占用空间为500M。但是,使用它时,内存占用量将越来越大。在我用完堆空间之前,它已达到6GB。
有人知道这里发生了什么吗?