Get size of file while writer is still open on HDFS

时间:2018-08-22 13:47:16

标签: java hdfs avro apache-kafka-connect confluent

I'm trying to poll the file size of a temp. avro file that's being written to on HDFS from a Kafka Topic, but org.apache.hadoop.fs.FileStatus keeps returning 0 bytes (.getLen()), while the writer is still open and writing.

I could keep a counter of length at the writer end, but deep down the data is converted into a binary format (avro) that differs in length from the original record. It could be approximated, but I'm looking for an more precise solution.

Is there a way to get the size of a still open hdfs file from either the hdfs (io.confluent.connect.hdfs.storage.HdfsStorage) perspective or the file writer (io.confluent.connect.storage.format.RecordWriter) perspective?

1 个答案:

答案 0 :(得分:0)

最后,我扩展了RecordWriter中使用的AvroRecordWriterProvider,并在FSDataOutputStream周围添加了一个包装器,以查询TopicPartitionWriter

中的当前大小。

法律通过后,我会将代码推送到分叉并提供所有感兴趣者的链接。