错误JobScheduler:为时间生成作业时出错

时间:2019-09-14 07:09:31

标签: pyspark spark-streaming

我正在监视hdfs目录 / apps / spark ,群集中的所有spark应用程序日志均写入该目录中,以获取这些日志的文件名以进行进一步分析。我正在尝试使用以下代码进行此操作:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

def f(rdd):

    debug = rdd.toDebugString()
    lines = debug.split("\n")[2:]

    for l in lines:
        file = l.split()[1].split("/")[-1]
        print('File => {}'.format(file))

# Create a local StreamingContext with batch interval of 1 second
sc = SparkContext("local[*]", "Basic")
ssc = StreamingContext(sc, 1)

textFile = ssc.textFileStream("hdfs:///apps/spark/")
textFile.foreachRDD(f)

ssc.start()             # Start
ssc.awaitTermination()  # Wait
ssc.stop()              # Stop

/apps/spark contents

/ apps / mapr / spark是/ apps / spark hdfs目录的NFS挂载。

应用程序启动日志的属性是,在应用程序处于活动状态时,文件名的名称末尾将带有.inprogress,但完成后,将从文件名中删除.inprogress。

因此,当上面的代码运行时,我可以将文件名获取为“ .inprogress”,如下所示:

文件=> application_1561982966645_259789.inprogress

但是,一旦应用程序完成并由此文件名更改,我的“ spark-submit /users/hdpgis/code/stream.py” 作业中将收到以下错误:

19/09/13 23:57:37 ERROR JobScheduler: Error generating jobs for time 1568444257000 ms
java.lang.NullPointerException
.
.
.
File "/users/hdpgis/code/stream.py", line 69, in <module>
ssc.awaitTermination()  # Wait for the computation to terminate

File "/opt/mapr/spark/spark-2.2.1/python/lib/pyspark.zip/pyspark/streaming/context.py", line 206, in awaitTermination

File "/opt/mapr/spark/spark-2.2.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__

File "/opt/mapr/spark/spark-2.2.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o55.awaitTermination.

: java.lang.NullPointerException
.
.
.
2019-09-13 23:57:37,1621 ERROR Client fs/client/fileclient/cc/client.cc:7104 Thread: 12506 Getfidmap failed for file /apps/spark/application_1561982966645_259789, rpc error Permission denied(13) for fid 2049.11286016.1850438190

2019-09-13 23:57:37,1621 ERROR Client fs/client/fileclient/cc/client.cc:5205 Thread: 12506 GetFidMap failed for file /apps/spark/application_1561982966645_259789, Could not get fid for offset 65536, err 13for fid 2049.11286016.1850438190

2019-09-13 23:57:37,1621 ERROR JniCommon fs/client/fileclient/cc/jni_MapRClient.cc:3396 Thread: 12506 getBlockInfo failed for file /apps/spark/application_1561982966645_259789, could not get fidmap

我知道这绝对是由于成功完成应用程序后文件名随时间变化的方式所致。有没有办法解决此错误,而只是继续继续查看目录中的新日志文件?

注意::application_1561982966645_259789.inprogress和application_1561982966645_259789都具有相同的权限集770。

关于火花流为何可以读取.inprogress文件而不是后来的文件,我只是不明白。

0 个答案:

没有答案