Question

您好我使用flume将文件从假脱机目录复制到HDFS，使用文件作为频道。

#Component names
a1.sources = src
a1.channels = c1
a1.sinks = k1

#Source details
a1.sources.src.type = spooldir
a1.sources.src.channels = c1
a1.sources.src.spoolDir = /home/cloudera/onetrail
a1.sources.src.fileHeader = false
a1.sources.src.basenameHeader = true
# a1.sources.src.basenameHeaderKey = basename
a1.sources.src.fileSuffix = .COMPLETED
a1.sources.src.threads = 4
a1.sources.src.interceptors = newint
a1.sources.src.interceptors.newint.type = timestamp

#Sink details
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs:///data/contentProviders/cnet/%Y%m%d/
# a1.sinks.k1.hdfs.round = false
# a1.sinks.k1.hdfs.roundValue = 1
# a1.sinks.k1.hdfs.roundUnit = second
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
#a1.sinks.k1.hdfs.file.Type = DataStream
a1.sinks.k1.hdfs.filePrefix = %{basename}
# a1.sinks.k1.hdfs.fileSuffix = .xml
a1.sinks.k1.threadsPoolSize = 4

# use a single file at a time
a1.sinks.k1.hdfs.maxOpenFiles = 1

# rollover file based on maximum size of 10 MB
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.batchSize = 12

# Channel details
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /tmp/flume/checkpoint/
a1.channels.c1.dataDirs = /tmp/flume/data/

# Bind the source and sink to the channel
a1.sources.src.channels = c1
a1.sinks.k1.channels = c1

使用上述配置，它可以将文件复制到hdfs，但我遇到的问题是一个文件是保持为.tmp而不是复制完整的文件内容。

有人可以帮我解决可能出现的问题。

Answer 1

一旦被Flume“滚动”，.tmp文件就会重命名为它的最终名称。

您的所有翻转设置均为0，表示“保持流永久打开”。

将一个或多个值设置为非零值确定Flume何时认为文件已完成，以便它可以关闭它并打开下一个文件。

有关详细信息，请参阅documentation here on Flume Sinks。

flume仍保留.tmp文件，而不是将文件完全复制到HDFS

1 个答案: