Question

是否有更快或更有效的方式在distcp以外的HDFS上复制文件。我尝试了常规hadoop fs -cp和distcp，两者似乎都提供相同的传输速率，大约50 MBPS。

我有5TB的数据分成每个500GB的较小文件，我必须将其复制到HDFS上的新位置。有什么想法吗？

编辑：原始distcp只生成1个映射器，因此我添加了-m100选项以增加映射器

hadoop distcp -D mapred.job.name="Gigafiles distcp" -pb -i -m100 "/user/abc/file1" "/xyz/aaa/file1"

但它仍然只产生1个而不是100个映射器。我在这里错过了什么吗？

Answer 1

如果您想将文件的子集从文件夹复制到HDFS中的另一个文件夹，我想到了这一点。它可能不如distcp高效，但是可以完成工作，并在您要执行其他操作时给您更多的自由。它还检查每个文件是否已经存在：

import pandas as pd
import os
from multiprocessing import Process
from subprocess import Popen, PIPE
hdfs_path_1 = '/path/to/the/origin/'
hdfs_path_2 = '/path/to/the/destination/'
process = Popen(f'hdfs dfs -ls -h {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
already_processed = [fn.split()[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
print(f'Total number of ALREADY PROCESSED tar files = {len(already_processed)}')

df = pd.read_csv("list_of_files.csv")  # or any other lists that you have
to_do_tar_list = list(df.tar)
to_do_list = set(to_do_tar_list) - set(already_processed)
print(f'To go: {len(to_do_list)}')

def copyy(f):
    process = Popen(f'hdfs dfs -cp {hdfs_path_1}{f} {hdfs_path_2}', shell=True, stdout=PIPE, stderr=PIPE)
    std_out, std_err = process.communicate()
    if std_out!= b'':
        print(std_out)

ps = []
for f in to_do_list:
    p = Process(target=copyy, args=(f,))
    p.start()
    ps.append(p)
for p in ps:
    p.join()
print('done')

如果您要在目录中包含所有文件的列表，请使用以下命令：

from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]

Answer 2

我能够通过使用猪脚本从路径A读取数据，转换为镶木地板（无论如何都是所需的存储格式）并将其写入路径B来解决这个问题。该过程平均花费近20分钟500GB文件。谢谢你的建议。

Hadoop中的高效复制方法

2 个答案: