使用子进程输出到HDFS中的文件

时间:2014-03-12 11:13:39

标签: python subprocess hdfs

我有一个逐行读取文本的脚本,稍微修改一行,然后将该行输出到文件中。我可以将文本读入文件中,问题是我无法输出文本。这是我的代码。

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"], stdout=subprocess.PIPE)
for line in cat.stdout:
    line = line+"Blah";
    subprocess.Popen(["hadoop", "fs", "-put", "/user/test/moddedfile.txt"], stdin=line)

这是我得到的错误。

AttributeError: 'str' object has no attribute 'fileno'
cat: Unable to write to output stream.

2 个答案:

答案 0 :(得分:5)

stdin参数不接受字符串。它应该是PIPENone或现有文件(具有有效.fileno()的内容或整数文件描述符)。

from subprocess import Popen, PIPE

cat = Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
            stdout=PIPE, bufsize=-1)
put = Popen(["hadoop", "fs", "-put", "-", "/user/test/moddedfile.txt"],
            stdin=PIPE, bufsize=-1)
for line in cat.stdout:
    line += "Blah"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()

答案 1 :(得分:2)

快速,快捷地开展代码工作:

import subprocess
from tempfile import NamedTemporaryFile

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/user/test/myfile.txt"],
                       stdout=subprocess.PIPE)

with NamedTemporaryFile() as f:
    for line in cat.stdout:
        f.write(line + 'Blah')

    f.flush()
    f.seek(0)

    cat.wait()

    put = subprocess.Popen(["hadoop", "fs", "-put", f.name,  "/user/test/moddedfile.txt"],
                           stdin=f)
    put.wait()

但我建议您查看hdfs / webhdfs python库。

例如pywebhdfs