连接超时错误:AWS-Glue ETL作业已理解

时间:2019-06-02 21:42:55

标签: python pyspark aws-glue

我正在尝试使用由AWS提供的python模板,我对其进行了修改,以使用胶水将yelp审查数据集(LARGE JSON文件)加载到S3存储桶中,在此应用嵌入在python脚本中的理解API。

我继续收到错误消息:

  

ConnectTimeout:   HTTPSConnectionPool(host ='comprehend.us-east-1.amazonaws.com',   port = 443):超过最大重试次数的网址:/(由引起   ConnectTimeoutError(,'与comprehend.us-east-1.amazonaws.com的连接   时间到。 (connect timeout = 60)'))

我已经将原始模板更新为使用json而不是镶木地板。另外,我已经将原始文件中的批处理数量从10更新为1000(在上述脚本中,其中NUMBER_OF_BATCHES = 1000)。我还有什么可以优化我的代码的,所以我不会再收到错误了?这是现有代码,其中包含指向非常大的JSON文件的s3文件路径,我正在尝试应用理解的API:

import os
import sys
import boto3

from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions

import pyspark.sql.functions as F
from pyspark.sql import Row, Window, SparkSession
from pyspark.sql.types import *
from pyspark.conf import SparkConf
from pyspark.context import SparkContext


args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = SparkSession.builder.config("spark.sql.broadcastTimeout", "6000").getOrCreate()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

spark._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark._jsc.hadoopConfiguration().set("json.enable.summary-metadata", "false")

AWS_REGION = 'us-east-1'
MIN_SENTENCE_LENGTH_IN_CHARS = 10
MAX_SENTENCE_LENGTH_IN_CHARS = 4500
COMPREHEND_BATCH_SIZE = 25  ## This is the max batch size for comprehend
NUMBER_OF_BATCHES = 1000

SentimentRow = Row("review_id", "sentiment")
def getBatchSentiment(input_list):
  arr = []
  bodies = [i[1] for i in input_list]
  client = boto3.client('comprehend',region_name = AWS_REGION)

  def callApi(text_list):
    response = client.batch_detect_sentiment(TextList = text_list, LanguageCode = 'en')
    return response

  for i in range(NUMBER_OF_BATCHES-1):
    text_list = bodies[COMPREHEND_BATCH_SIZE * i : COMPREHEND_BATCH_SIZE * (i+1)]
    response = callApi(text_list)
    for r in response['ResultList']:
      idx = COMPREHEND_BATCH_SIZE * i + r['Index']
      arr.append(SentimentRow(input_list[idx][0], r['Sentiment']))

  return arr

### Main function to process data through comprehend

## Read yelp academic dataset reviews
reviews = spark.read.json("s3://schmldemobucket/yelp_academic_dataset_review.json").distinct()

df = reviews \
  .withColumn('body_len', F.length('text')) \
  .filter(F.col('body_len') > MIN_SENTENCE_LENGTH_IN_CHARS) \
  .filter(F.col('body_len') < MAX_SENTENCE_LENGTH_IN_CHARS) 

record_count = df.count()

df2 = df \
  .repartition(record_count / (NUMBER_OF_BATCHES * COMPREHEND_BATCH_SIZE)) \
  .sortWithinPartitions(['review_id'], ascending=True)

group_rdd = df2.rdd.map(lambda l: (l.review_id, l.text)).glom()
sentiment = group_rdd.coalesce(10).map(lambda l: getBatchSentiment(l)).flatMap(lambda x: x).toDF().repartition('review_id').cache()

## Join sentiment results with the yelp review dataset
joined = reviews \
  .drop('text') \
  .join(sentiment, sentiment.review_id == reviews.review_id) \
  .drop(sentiment.review_id)

## Write out result set to S3 in JSON format
joined.write.partitionBy('business_id').mode('overwrite').json('s3://schmldemobucket/')

job.commit()

我希望收到一个成功的胶水etl职位状态,但是我不确定如何进一步优化以取得一个成功的职位状态。请帮忙!!!!

0 个答案:

没有答案