AWS Glue Job无限期运行

时间:2018-08-21 17:10:46

标签: aws-glue

对于AWS Glue来说还很陌生,我们正在构建一个ETL流程,该流程可将来自不同来源的数据提取到redshift实例中。其中之一来自mixpanel.com,为此我们编写了一个lambda函数,该函数从mixpanel以json格式查询raw_data并将其存储在日期格式的文件夹中到s3存储桶中。

然后,使用AWS Glue爬网文件夹并为json文件创建表结构,该结构成功完成。我们还定义了数据库和与redshift的连接,该作业也已成功创建。

执行作业时,不会记录任何错误,该作业将一直运行直到超时,并显示以下信息日志

--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000 --conf spark.hadoop.fs.defaultFS=hdfs://ip-10-241-110-211.ec2.internal:8020 --conf spark.hadoop.yarn.resourcemanager.address=ip-10-241-110-211.ec2.internal:8032 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=98 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --JOB_ID j_22da3607e1f4f9ca9bb4021f83ce82bd704a98865369ca1fc0693b3bd8f687ec --JOB_RUN_ID jr_a289c915a25344c5c45742150029bdab7e48be1766821463921b9350de54d609 --scriptLocation s3://aws-glue-scripts-xxxxxxxx-us-east-1/folder_a/mixpanel-to-redshift2 --job-bookmark-option job-bookmark-disable --job-language python --TempDir s3://aws-glue-temporary-xxxxxxxx-us-east-1/folder_a --JOB_NAME mixpanel-to-redshift2
YARN_RM_DNS=ip-10-241-110-211.ec2.internal
Detected region us-east-1
JOB_NAME = mixpanel-to-redshift2
Specifying us-east-1 while copying script.
Completed 10.0 KiB/10.0 KiB (90.4 KiB/s) with 1 file(s) remaining
download: s3://aws-glue-scripts-xxxxxxxx-us-east-1/folder_a/mixpanel-to-redshift2 to ./script_2018-08-21-16-37-20.py
SCRIPT_URL = /tmp/g-cc03d49ea3dd3ed71e42af3c1f611dee359f977b-5918932981993359503/script_2018-08-21-16-37-20.py
/usr/lib/spark/bin/spark-submit --conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000 --conf spark.hadoop.fs.defaultFS=hdfs://ip-10-241-110-211.ec2.internal:8020 --conf spark.hadoop.yarn.resourcemanager.address=ip-10-241-110-211.ec2.internal:8032 --conf spark.dynamicAllocation.enabled=true --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.minExecutors=1 --conf spark.dynamicAllocation.maxExecutors=98 --conf spark.executor.memory=5g --conf spark.executor.cores=4 --name tape --master yarn --deploy-mode cluster --jars /opt/amazon/superjar/glue-assembly.jar --files /tmp/glue-default.conf,/tmp/glue-override.conf,/opt/amazon/certs/InternalAndExternalAndAWSTrustStore.jks,/opt/amazon/certs/rds-combined-ca-bundle.pem,/opt/amazon/certs/redshift-ssl-ca-cert.pem,/opt/amazon/certs/RDSTrustStore.jks,/tmp/image-creation-time,,/tmp/g-cc03d49ea3dd3ed71e42af3c1f611dee359f977b-5918932981993359503/script_2018-08-21-16-37-20.py --py-files /tmp/PyGlue.zip /tmp/runscript.py script_2018-08-21-16-37-20.py --JOB_NAME mixpanel-to-redshift2 --JOB_ID j_22da3607e1f4f9ca9bb4021f83ce82bd704a98865369ca1fc0693b3bd8f687ec --JOB_RUN_ID jr_a289c915a25344c5c45742150029bdab7e48be1766821463921b9350de54d609 --job-bookmark-option job-bookmark-disable --TempDir s3://aws-glue-temporary-xxxxxxxx-us-east-1/folder_a
18/08/21 16:37:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/08/21 16:37:23 INFO RMProxy: Connecting to ResourceManager at ip-10-241-110-211.ec2.internal/10.241.110.211:8032
18/08/21 16:37:24 INFO Client: Requesting a new application from cluster with 49 NodeManagers
18/08/21 16:37:24 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (12288 MB per container)
18/08/21 16:37:24 INFO Client: Will allocate AM container, with 5632 MB memory including 512 MB overhead
18/08/21 16:37:24 INFO Client: Setting up container launch context for our AM
18/08/21 16:37:24 INFO Client: Setting up the launch environment for our AM container
18/08/21 16:37:24 INFO Client: Preparing resources for our AM container
18/08/21 16:37:25 DEBUG Client: 
18/08/21 16:37:25 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
18/08/21 16:37:28 INFO Client: Uploading resource file:/tmp/spark-b2b371a1-29b9-4911-8b6d-d184ca2928c8/__spark_libs__224704182240168405.zip -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/__spark_libs__224704182240168405.zip
18/08/21 16:37:31 INFO Client: Uploading resource file:/opt/amazon/superjar/glue-assembly.jar -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/glue-assembly.jar
18/08/21 16:37:36 INFO Client: Uploading resource file:/tmp/glue-default.conf -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/glue-default.conf
18/08/21 16:37:36 INFO Client: Uploading resource file:/tmp/glue-override.conf -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/glue-override.conf
18/08/21 16:37:36 INFO Client: Uploading resource file:/opt/amazon/certs/InternalAndExternalAndAWSTrustStore.jks -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/InternalAndExternalAndAWSTrustStore.jks
18/08/21 16:37:37 INFO Client: Uploading resource file:/opt/amazon/certs/rds-combined-ca-bundle.pem -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/rds-combined-ca-bundle.pem
18/08/21 16:37:37 INFO Client: Uploading resource file:/opt/amazon/certs/redshift-ssl-ca-cert.pem -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/redshift-ssl-ca-cert.pem
18/08/21 16:37:37 INFO Client: Uploading resource file:/opt/amazon/certs/RDSTrustStore.jks -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/RDSTrustStore.jks
18/08/21 16:37:37 INFO Client: Uploading resource file:/tmp/image-creation-time -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/image-creation-time
18/08/21 16:37:37 INFO Client: Uploading resource file:/tmp/g-cc03d49ea3dd3ed71e42af3c1f611dee359f977b-5918932981993359503/script_2018-08-21-16-37-20.py -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/script_2018-08-21-16-37-20.py
18/08/21 16:37:38 INFO Client: Uploading resource file:/tmp/runscript.py -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/runscript.py
18/08/21 16:37:38 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/pyspark.zip
18/08/21 16:37:38 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/py4j-0.10.4-src.zip
18/08/21 16:37:38 INFO Client: Uploading resource file:/tmp/PyGlue.zip -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001
/PyGlue.zip
18/08/21 16:37:38 INFO Client: Uploading resource file:/tmp/spark-b2b371a1-29b9-4911-8b6d-d184ca2928c8/__spark_conf__8624620488004095183.zip -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001/__spark_conf__.zip
18/08/21 16:37:38 DEBUG Client: ===============================================================================
18/08/21 16:37:38 DEBUG Client: YARN AM launch context:
18/08/21 16:37:38 DEBUG Client: user class: org.apache.spark.deploy.PythonRunner
18/08/21 16:37:38 DEBUG Client: env:
18/08/21 16:37:38 DEBUG Client: CLASSPATH -> ./*:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/*<CPS>$HADOOP_COMMON_HOME/share/hadoop/common/lib/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/*<CPS>$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/*<CPS>$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
18/08/21 16:37:38 DEBUG Client: SPARK_YARN_STAGING_DIR -> hdfs://ip-10-241-110-211.ec2.internal:8020/user/root/.sparkStaging/application_1534869144604_0001
18/08/21 16:37:38 DEBUG Client: SPARK_USER -> root
18/08/21 16:37:38 DEBUG Client: SPARK_YARN_MODE -> true
18/08/21 16:37:38 DEBUG Client: PYTHONHASHSEED -> 0
18/08/21 16:37:38 DEBUG Client: PYTHONPATH -> {{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip<CPS>{{PWD}}/PyGlue.zip
18/08/21 16:37:38 DEBUG Client: resources:
18/08/21 16:37:38 DEBUG Client: image-creation-time -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/image-creation-time" } size: 11 timestamp: 1534869457929 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1534869458532 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: glue-assembly.jar -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/glue-assembly.jar" } size: 403225879 timestamp: 1534869456636 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: pyspark.zip -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/pyspark.zip" } size: 482687 timestamp: 1534869458409 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: __spark_libs__ -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/__spark_libs__224704182240168405.zip" } size: 218269777 timestamp: 1534869451273 type: ARCHIVE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: redshift-ssl-ca-cert.pem -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/redshift-ssl-ca-cert.pem" } size: 8621 timestamp: 1534869457772 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: rds-combined-ca-bundle.pem -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/rds-combined-ca-bundle.pem" } size: 26016 timestamp: 1534869457593 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: glue-default.conf -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/glue-default.conf" } size: 382 timestamp: 1534869456746 type: FILE visibility: PRIVATE
18/
08/21 16:37:38 DEBUG Client: runscript.py -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/runscript.py" } size: 3549 timestamp: 1534869458289 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: glue-override.conf -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/glue-override.conf" } size: 276 timestamp: 1534869456853 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: InternalAndExternalAndAWSTrustStore.jks -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/InternalAndExternalAndAWSTrustStore.jks" } size: 120188 timestamp: 1534869457478 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: PyGlue.zip -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/PyGlue.zip" } size: 101883 timestamp: 1534869458663 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: __spark_conf__ -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/__spark_conf__.zip" } size: 8005 timestamp: 1534869458698 type: ARCHIVE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: RDSTrustStore.jks -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/RDSTrustStore.jks" } size: 19135 timestamp: 1534869457813 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: script_2018-08-21-16-37-20.py -> resource { scheme: "hdfs" host: "ip-10-241-110-211.ec2.internal" port: 8020 file: "/user/root/.sparkStaging/application_1534869144604_0001/script_2018-08-21-16-37-20.py" } size: 10272 timestamp: 1534869458118 type: FILE visibility: PRIVATE
18/08/21 16:37:38 DEBUG Client: command:
18/08/21 16:37:38 DEBUG Client: LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH" {{JAVA_HOME}}/bin/java -server -Xmx5120m -Djava.io.tmpdir={{PWD}}/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' '-Djavax.net.ssl.trustStore=InternalAndExternalAndAWSTrustStore.jks' '-Djavax.net.ssl.trustStoreType=JKS' '-Djavax.net.ssl.trustStorePassword=amazon' '-DRDS_ROOT_CERT_PATH=rds-combined-ca-bundle.pem' '-DREDSHIFT_ROOT_CERT_PATH=redshift-ssl-ca-cert.pem' '-DRDS_TRUSTSTORE_URL=file:RDSTrustStore.jks' -Dspark.yarn.app.container.log.dir=<LOG_DIR> org.apache.spark.deploy.yarn.ApplicationMaster --class 'org.apache.spark.deploy.PythonRunner' --primary-py-file runscript.py --arg 'script_2018-08-21-16-37-20.py' --arg '--JOB_NAME' --arg 'mixpanel-to-redshift2' --arg '--JOB_ID' --arg 'j_22da3607e1f4f9ca9bb4021f83ce82bd704a98865369ca1fc0693b3bd8f687ec' --arg '--JOB_RUN_ID' --arg 'jr_a289c915a25344c5c45742150029bdab7e48be1766821463921b9350de54d609' --arg '--job-bookmark-option' --arg 'job-bookmark-disable' --arg '--TempDir' --arg 's3://aws-glue-temporary-xxxxxxxx-us-east-1/folder_a' --properties-file {{PWD}}/__spark_conf__/__spark_conf__.properties 1> <LOG_DIR>/stdout 2> <LOG_DIR>/stderr
18/08/21 16:37:38 DEBUG Client: ===============================================================================
18/08/21 16:37:38 INFO SecurityManager: Changing view acls to: root
18/08/21 16:37:38 INFO SecurityManager: Changing modify acls to: root
18/08/21 16:37:38 INFO SecurityManager: Changing view acls groups to: 
18/08/21 16:37:38 INFO SecurityManager: Changing modify acls groups to: 
18/08/21 16:37:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); g
roups with modify permissions: Set()
18/08/21 16:37:38 INFO Client: Submitting application application_1534869144604_0001 to ResourceManager
18/08/21 16:37:39 INFO YarnClientImpl: Submitted application application_1534869144604_0001
18/08/21 16:37:40 INFO Client: Application report for application_1534869144604_0001 (state: ACCEPTED)
applicationid is application_1534869144604_0001, yarnRMDNS is ip-10-241-110-211.ec2.internal
Application info reporting is enabled.
----------Recording application Id and Yarn RM DNS for cancellation-----------------
10.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:49 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:49 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:50 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:50 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:51 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:51 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:52 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:52 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:53 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:53 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:54 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:54 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:55 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:55 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:56 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:56 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:57 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:57 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:58 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:58 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:37:59 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:37:59 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:38:00 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:38:00 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root
18/08/21 16:38:01 INFO Client: Application report for application_1534869144604_0001 (state: RUNNING)
18/08/21 16:38:01 DEBUG Client: 
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.241.110.165
ApplicationMaster RPC port: 0
queue: default
start time: 1534869458945
final status: UNDEFINED
tracking URL: http://ip-10-241-110-211.ec2.internal:20888/proxy/application_1534869144604_0001/
user: root

我们试图增加UDP和超时,但是我怀疑工作需要20多个薄荷糖,因为json文件不大于50MB。在这一点上,我们不确定如何进一步调试问题和根本原因。

我们非常感谢您的建议

更新

经过进一步调查,结果发现JSON文件实际上是JSONL格式,每行有效json负载。 AWS Glue似乎不知道如何处理文件,并且可能需要自定义分类器

0 个答案:

没有答案