如何将文件从FTP服务器以递增方式复制到Hadoop HDFS

时间:2018-02-12 14:13:37

标签: hadoop ftp hdfs

我们有一台FTP服务器,每天都有很多文件上传到FTP服务器,我需要在HDFS中复制所有这些文件。

每次只下载增量文件,即第一次下载10个文件,之后FTP上传了5个新文件;在下一次作业迭代中,它应该只下载HDFS中的新5个文件。

我们没有使用Nifi或Kafka连接。

我们是否有任何好的解决方案来完成这项任务。

1 个答案:

答案 0 :(得分:1)

您可以使用LFTP作业中的触摸文件来实现此目的,下面是我的解释和代码。查看每一步的评论。

#!bin/bash
#SomeConfigs
TOUCHFILE='/somepath/inYourLocal/someFilename.touch'
RemoteSFTPserverPath='/Remote/Server/path/toTheFiles'
LocalPath='/Local/Path/toReceiveTheFiles'
FTP_Server_UserName='someUser'
FTP_Server_Password='SomePassword'
ServerIP='//127.12.11.35'

#Transfer files from FTP Server #This is the main command
ftp_command="lftp -e 'mirror --only-missing --newer-than=${TOUCHFILE} --older-than=now-2minutes --parallel=4 --no-recursion --include "SomeFileName*.csv"  ${RemoteSFTPserverPath}/ ${LocalPath}/;exit' -u ${FTP_Server_UserName},${FTP_Server_Password} sftp:${ServerIP}"

#CommandToexecute The Job
eval ${ftp_command}

#After finishing the lftp job You have to update the touch file for the next job
#This will update to current timestamp
touch /somepath/inYourLocal/someFilename.touch

#If you want to update with the last file received time
TchDate=$(stat -c %y "$(ls -1t ${LocalPath} | head -n1)" | date)
touch -d ${TchDate} /somepath/inYourLocal/someFilename.touch

#Stat on latest file in remote server #You can do this way also
TchDate=$(ssh -o StrictHostKeyChecking=no ${FTP_Server_UserName}@${FTP_Server_Password} "stat -c %y \"$(ls -1t ${RemoteSFTPserverPath}/ | head -n1)\" | date")
touch -d ${TchDate} /somepath/inYourLocal/someFilename.touch

#Once you have the files in your local you can copy them to hdfs

hdfs dfs -put -f /Local/Path/toReceiveTheFiles/*.csv /HDFS/PATH

#Remove the files in local so that you can accommodate for the upcoming files
rm -r -f /Local/Path/toReceiveTheFiles/*.csv

在LFTP工作中,您有很多选项man lftp将是您的最佳来源