Question

我对PySpark非常陌生。我写了一段代码来估算7个预测变量。我正在使用tfidf转换后的描述列来训练NB模型并预测NaN。

to_predict_cols = ['col1', 'col2', ... 'col7']
current_x = ['features_tfidf']
for iter_no, to_predict_col in enumerate(to_predict_cols):
    temp_data_imputed_level1 = data_imputed_level1.select(["_c0"] + current_x + [to_predict_col])

    # create test
    test = temp_data_imputed_level1.where(col(to_predict_col).isNull()).select(["_c0"] + current_x)

    # create train
    train = temp_data_imputed_level1.dropna()

    # create label indexing of the response
    label_stringIndexer = StringIndexer(inputCol = to_predict_col, outputCol = "label")
    stringIndexer_model = label_stringIndexer.fit(train)
    train = stringIndexer_model.transform(train)
    print("label encoding.. [OK]")

    # fit the NB classifier
    nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
    model = nb.fit(train)
    print("model building.. [OK]")

    # predict output
    predictions = model.transform(test)

    # inverse label encoding of the predictions
    labelReverse = IndexToString(inputCol='prediction', outputCol='labeled_prediction', labels=stringIndexer_model.labels)
    predictions = labelReverse.transform(predictions)
    print("model predictions.. [OK]")

我正在使用一些联接来保存我未包括的预测。但是随着循环在第3-4次迭代中进行，它从一开始就变得太慢了，即联接并没有影响速度。此外，在执行循环之前，还会缓存数据。

我尝试了sc._jvm.System.gc()并删除了所有手动启动的变量。但是仍然没有运气。

我在具有256GB RAM和48个内核的服务器上。目前，我尚未连接到任何群集。数据大小为1.5GB。

那么，关于我所缺少的或如何加快流程的任何想法/建议吗？任何帮助表示赞赏。预先感谢。

PySpark的执行在每次迭代中都会变慢

0 个答案: