代码

Question

我目前正致力于PySpark工作（Spark 2.2.0），该工作旨在根据一组文档训练Latent Dirichlet分配模型。输入文档作为位于Google云端存储上的CSV文件提供。

以下代码在单个节点上成功运行了Google Cloud Dataproc集群（4vCPU / 15GB内存），其中包含一小部分文档（~6500），生成的主题数量较少（10）以及迭代次数较少（100）。但是，对于主题数量或迭代次数较多的文档或更高值的其他尝试很快会导致内存问题和作业失败。

此外，当将此作业提交到4节点群集时，我可以看到只有一个工作节点实际工作（CPU使用率为30％），让我觉得代码没有针对并行处理进行适当优化。

代码

conf = pyspark.SparkConf().setAppName("lda-training")
spark = SparkSession.builder.config(conf=conf).getOrCreate()

# CSV schema declaration
csv_schema = StructType([StructField("doc_id", StringType(), True),  # id of the document
                         StructField("cleaned_content", StringType(), True)])  # cleaned text content (used for LDA)

# Step 1: Load CSV
doc_df = spark.read.csv(path="gs://...", encoding="UTF-8", schema=csv_schema)

print("{} document(s) loaded".format(doc_df.count()))
# This prints "25000 document(s) loaded"

print("{} partitions".format(doc_df.rdd.getNumPartitions()))
# This prints "1"

# Step 2: Extracting words
extract_words = functions.udf(lambda row: split_row(row), ArrayType(StringType()))
doc_df = doc_df.withColumn("words", extract_words(doc_df["cleaned_content"]))

# Step 3: Generate count vectors (BOW) for each document
count_vectorizer = CountVectorizer(inputCol="words", outputCol="features")
vectorizer_model = count_vectorizer.fit(doc_df)
dataset = vectorizer_model.transform(doc_df)

# Instantiate LDA model
lda = LDA(k=100,  # number of topics
          optimizer="online", # 'online' or 'em'
          maxIter=100,
          featuresCol="features",
          topicConcentration=0.01,  # beta
          optimizeDocConcentration=True,  # alpha
          learningOffset=2.0,
          learningDecay=0.8,
          checkpointInterval=10,
          keepLastCheckpoint=True)

# Step 4: Train LDA model on corpus (this is the long part of the job)
lda_model = lda.fit(dataset)

# Save LDA model to Cloud Storage
lda_model.write().overwrite().save("gs://...")

Bellow是遇到警告和错误消息的示例：

WARN org.apache.spark.scheduler.TaskSetManager: Stage 7 contains a task of very large size (3791 KB). The maximum recommended task size is 100 KB.
WARN org.apache.spark.scheduler.TaskSetManager: Stage 612 contains a task of very large size (142292 KB). The maximum recommended task size is 100 KB.
WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 6.1 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 303.0 (TID 302, cluster-lda-w-1.c.cognitive-search-engine-dev.internal, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 6.1 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
ERROR org.apache.spark.scheduler.cluster.YarnScheduler: Lost executor 3 on cluster-lda-w-1.c.cognitive-search-engine-dev.internal: Container killed by YARN for exceeding memory limits. 6.1 GB of 6 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

问题

是否可以对代码本身进行任何优化以确保其可伸缩性？
我们如何让Spark在所有工作节点上分配作业，并希望避免内存问题？

Answer 1

如果您的输入数据大小很小，即使您的管道最终对小数据执行密集计算，那么基于大小的分区将导致分区太少以实现可伸缩性。由于您的getNumPartitions()打印1，这表明Spark将使用最多1个执行程序核心来处理该数据，这就是您只看到一个工作节点工作的原因。

您可以尝试更改最初的spark.read.csv行，最后添加repartition：

doc_df = spark.read.csv(path="gs://...", ...).repartition(32)

然后，您可以通过在后面的行中看到getNumPartitions()打印32来验证它是否符合您的预期。

如何使pyspark作业在多个节点上正确并行化并避免内存问题？

代码

问题

1 个答案: