直接在HDFS中生成文件

时间:2016-01-11 13:12:29

标签: python hadoop hdfs

有没有办法直接在HDFS上生成文件? 我想避免生成本地文件,然后通过hdfs命令行,如: hdfs dfs -put - "file_name.csv"要复制到HDFS。

或者有没有python库?

4 个答案:

答案 0 :(得分:0)

您是否尝试过使用HdfsCli

引用段落Reading and Writing files

# Loading a file in memory.
with client.read('features') as reader:
  features = reader.read()

# Directly deserializing a JSON object.
  with client.read('model.json', encoding='utf-8') as reader:
    from json import load
    model = load(reader)

答案 1 :(得分:0)

当我使用hdfscli写入方法时,速度极慢吗? 有没有办法加速使用hdfscli?

with client.write(conf.hdfs_location+'/'+ conf.filename, encoding='utf-8', buffersize=10000000) as f:
writer = csv.writer(f, delimiter=conf.separator)
for i in tqdm(10000000000):
    row = [column.get_value() for column in conf.columns]
    writer.writerow(row)

非常感谢。

答案 2 :(得分:0)

hdfs dfs -put不需要你在本地创建文件。此外,无需在hdfs(touchz)上创建零字节文件并附加到它(appendToFile)。您可以直接在hdfs上写文件:

hadoop fs -put - /user/myuser/testfile

点击进入。在命令提示符下,输入要放入文件中的文本。完成后,请说Ctrl+D

答案 3 :(得分:0)

使用python将本地文件写入hdfs的两种方法:

一种方法是使用hdfs python软件包:

代码段:

from hdfs import InsecureClient
hdfsclient = InsecureClient('http://localhost:50070', user='madhuc')
hdfspath="/user/madhuc/hdfswritedata/"
localpath="/home/madhuc/sample.csv"
hdfsclient.upload(hdfspath, localpath)

输出位置:'/ user / madhuc / hdfswritedata / sample.csv'

否则是使用PIPE的子进程python包

代码段:

from subprocess import PIPE, Popen    
# put file into hdfs
put = Popen(["hadoop", "fs", "-put", localpath, hdfspath], stdin=PIPE, bufsize=-1)
put.communicate()    
print("File Saved Successfully")