最后阶段的Google Cloud Dataproc作业失败问题

时间:2017-07-12 12:10:31

标签: apache-spark pyspark google-cloud-dataproc

我在Dataproc上使用Spark集群,我的工作在处理结束时失败。

我的数据源是Google云端存储上csv格式的文本日志文件(总卷数为3.5TB,5000个文件)。

处理逻辑如下:

  • 将文件读取到DataFrame(schema [“timestamp”,“message”]);
  • 将所有消息分组到1秒的窗口中;
  • 应用管道[Tokenizer - > HashingTF]对每个分组的消息提取单词及其频率以构建特征向量;
  • 在GCS上保存带有时间轴的要素向量。

我遇到的问题是,在小数据子集(例如10个文件)上处理效果很好,但是当我在所有文件上运行它时,它最终会失败,例如“容器被YARN杀死”超出内存限制。使用25.0 GB的24 GB物理内存。考虑提升spark.yarn.executor.memoryOverhead。“

我的群集有25名工作人员使用n1-highmem-8机器。所以我用Google搜索了这个错误,并将“spark.yarn.executor.memoryOverhead”参数增加到了6500MB。

现在我的火花作业仍然失败,但错误“由于阶段失败导致作业中止:4293个任务的序列化结果总大小(1920.0 MB)大于spark.driver.maxResultSize(1920.0 MB)”

我是新手,我相信我做错了或在配置级别或我的代码中。如果你能帮我清理这些东西,那就太好了!

这是我的火花任务代码:

import logging
import string
from datetime import datetime

import pyspark
import re
from pyspark.sql import SparkSession

from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, TimestampType, ArrayType
from pyspark.sql import functions as F

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Constants
NOW = datetime.now().strftime("%Y%m%d%H%M%S")
START_DATE = '2016-01-01'
END_DATE = '2016-03-01'

sc = pyspark.SparkContext()
spark = SparkSession\
        .builder\
        .appName("LogsVectorizer")\
        .getOrCreate()
spark.conf.set('spark.sql.shuffle.partitions', 10000)

logger.info("Start log processing at {}...".format(NOW))

# Filenames to read/write locations
logs_fn = 'gs://databucket/csv/*'  
vectors_fn = 'gs://databucket/vectors_out_{}'.format(NOW)  
pipeline_fn = 'gs://databucket/pipeline_vectors_out_{}'.format(NOW)
model_fn = 'gs://databucket/model_vectors_out_{}'.format(NOW)


# CSV data schema to build DataFrame
schema = StructType([
    StructField("timestamp", StringType()),
    StructField("message", StringType())])

# Helpers to clean strings in log fields
def cleaning_string(s):
    try:
        # Remove ids (like: app[2352] -> app)
        s = re.sub('\[.*\]', 'IDTAG', s)
        if s == '':
            s = 'EMPTY'
    except Exception as e:
        print("Skip string with exception {}".format(e))
    return s

def normalize_string(s):
    try:
        # Remove punctuation
        s = re.sub('[{}]'.format(re.escape(string.punctuation)), ' ', s)
        # Remove digits
        s = re.sub('\d*', '', s)
        # Remove extra spaces
        s = ' '.join(s.split())
    except Exception as e:
        print("Skip string with exception {}".format(e)) 
    return s

def line_splitter(line):
    line = line.split(',')
    timestamp = line[0]
    full_message = ' '.join(line[1:])
    full_message = normalize_string(cleaning_string(full_message))
    return [timestamp, full_message]

# Read line from csv, split to date|message
# Read CSV to DataFrame and clean its fields
logger.info("Read CSV to DF...")
logs_csv = sc.textFile(logs_fn)
logs_csv = logs_csv.map(lambda line: line_splitter(line)).toDF(schema)

# Keep only lines for our date interval
logger.info("Filter by dates...")
logs_csv = logs_csv.filter((logs_csv.timestamp>START_DATE) & (logs_csv.timestamp<END_DATE))
logs_csv = logs_csv.withColumn("timestamp", logs_csv.timestamp.cast("timestamp"))

# Helpers to join messages into window and convert sparse to dense
join_ = F.udf(lambda x: "| ".join(x), StringType())
asDense = F.udf(lambda v: v.toArray().tolist())

# Agg by time window
logger.info("Group log messages by time window...")
logs_csv = logs_csv.groupBy(F.window("timestamp", "1 second"))\
                       .agg(join_(F.collect_list("message")).alias("messages"))

# Turn message to hashTF
tokenizer = Tokenizer(inputCol="messages", outputCol="message_tokens")
hashingTF = HashingTF(inputCol="message_tokens", outputCol="tokens_counts", numFeatures=1000)

pipeline_tf = Pipeline(stages=[tokenizer, hashingTF])

logger.info("Fit-Transform ML Pipeline...")
model_tf = pipeline_tf.fit(logs_csv)
logs_csv = model_tf.transform(logs_csv)

logger.info("Spase vectors to Dense list...")
logs_csv = logs_csv.sort("window.start").select(["window.start", "tokens_counts"])\
                   .withColumn("tokens_counts", asDense(logs_csv.tokens_counts))

# Save to disk
# Save Pipeline and Model
logger.info("Save models...")
pipeline_tf.save(pipeline_fn)
model_tf.save(model_fn)

# Save to GCS
logger.info("Save results to GCS...")
logs_csv.write.parquet(vectors_fn)

1 个答案:

答案 0 :(得分:1)

spark.driver.maxResultSize是驱动程序大小的问题,Dataproc在主节点上运行。

默认情况下,主机内存的1/4给予驱动程序,其中1/2被设置为spark.driver.maxResultSize(最大的RDD Spark会让你.collect()

我猜测TokenizerHashingTF正在移动&#34;元数据&#34;通过你的键空间大小的驱动程序。要增加允许的大小,您可以增加spark.driver.maxResultSize,但您可能还需要增加spark.driver.memory和/或使用更大的主数据。 Spark's configuration guide有更多信息。