我正在尝试使用由AWS提供的python模板,我对其进行了修改,以使用胶水将yelp审查数据集(LARGE JSON文件)加载到S3存储桶中,在此应用嵌入在python脚本中的理解API。
我继续收到错误消息:
ConnectTimeout: HTTPSConnectionPool(host ='comprehend.us-east-1.amazonaws.com', port = 443):超过最大重试次数的网址:/(由引起 ConnectTimeoutError(,'与comprehend.us-east-1.amazonaws.com的连接 时间到。 (connect timeout = 60)'))
我已经将原始模板更新为使用json而不是镶木地板。另外,我已经将原始文件中的批处理数量从10更新为1000(在上述脚本中,其中NUMBER_OF_BATCHES = 1000)。我还有什么可以优化我的代码的,所以我不会再收到错误了?这是现有代码,其中包含指向非常大的JSON文件的s3文件路径,我正在尝试应用理解的API:
import os
import sys
import boto3
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
import pyspark.sql.functions as F
from pyspark.sql import Row, Window, SparkSession
from pyspark.sql.types import *
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = SparkSession.builder.config("spark.sql.broadcastTimeout", "6000").getOrCreate()
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
spark._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
spark._jsc.hadoopConfiguration().set("json.enable.summary-metadata", "false")
AWS_REGION = 'us-east-1'
MIN_SENTENCE_LENGTH_IN_CHARS = 10
MAX_SENTENCE_LENGTH_IN_CHARS = 4500
COMPREHEND_BATCH_SIZE = 25 ## This is the max batch size for comprehend
NUMBER_OF_BATCHES = 1000
SentimentRow = Row("review_id", "sentiment")
def getBatchSentiment(input_list):
arr = []
bodies = [i[1] for i in input_list]
client = boto3.client('comprehend',region_name = AWS_REGION)
def callApi(text_list):
response = client.batch_detect_sentiment(TextList = text_list, LanguageCode = 'en')
return response
for i in range(NUMBER_OF_BATCHES-1):
text_list = bodies[COMPREHEND_BATCH_SIZE * i : COMPREHEND_BATCH_SIZE * (i+1)]
response = callApi(text_list)
for r in response['ResultList']:
idx = COMPREHEND_BATCH_SIZE * i + r['Index']
arr.append(SentimentRow(input_list[idx][0], r['Sentiment']))
return arr
### Main function to process data through comprehend
## Read yelp academic dataset reviews
reviews = spark.read.json("s3://schmldemobucket/yelp_academic_dataset_review.json").distinct()
df = reviews \
.withColumn('body_len', F.length('text')) \
.filter(F.col('body_len') > MIN_SENTENCE_LENGTH_IN_CHARS) \
.filter(F.col('body_len') < MAX_SENTENCE_LENGTH_IN_CHARS)
record_count = df.count()
df2 = df \
.repartition(record_count / (NUMBER_OF_BATCHES * COMPREHEND_BATCH_SIZE)) \
.sortWithinPartitions(['review_id'], ascending=True)
group_rdd = df2.rdd.map(lambda l: (l.review_id, l.text)).glom()
sentiment = group_rdd.coalesce(10).map(lambda l: getBatchSentiment(l)).flatMap(lambda x: x).toDF().repartition('review_id').cache()
## Join sentiment results with the yelp review dataset
joined = reviews \
.drop('text') \
.join(sentiment, sentiment.review_id == reviews.review_id) \
.drop(sentiment.review_id)
## Write out result set to S3 in JSON format
joined.write.partitionBy('business_id').mode('overwrite').json('s3://schmldemobucket/')
job.commit()
我希望收到一个成功的胶水etl职位状态,但是我不确定如何进一步优化以取得一个成功的职位状态。请帮忙!!!!