Spark JobEnd Listner从hdfs路径移动源文件导致找不到文件异常

时间:2019-09-24 08:17:34

标签: apache-spark hadoop spark-streaming

火花版本:2.3

  • 火花流应用程序正在从hdfs路径流式传输

      Dataset<Row> lines = spark
      .readStream()
      .format("text")
      .load("path");
    
  • 在对Data进行一些转换之后,对于一个文件,该作业应该处于成功状态。

  • 已添加作业列表器以结束作业,并在触发文件时移动文件。

    @Override
    public void onJobENd(SparkListenerJobEnd jobEnd) {
    // Move source file to some other location which is finished processing. 
    }
    
  • 文件已成功移动到另一个位置。但与此同时(精确的时间戳记),火花会在未找到文件后引发异常,这是随机发生的,无法复制。但是经常发生

  • 即使特定作业已结束,spark仍以某种方式引用该文件。

  • 如何确保作业结束后文件不会被spark引用,并避免找不到该文件的问题

  • 我可以在:here

  • 上找到它
  

SparkListenerJobEnd

DAGScheduler does cleanUpAfterSchedulerStop, handleTaskCompletion, failJobAndIndependentStages, and markMapStageJobAsFinished.

例外:

    java.io.FileNotFoundException: File does not exist: <filename>
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
    at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
    at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getBlockLocations(AuthorizationProviderProxyClientProtocol.java:87)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:363)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at org.apache.hadoop.ipc.RemoteExc