使用python和subprocess,Pipe,Popen从hdfs读取/写入文件会出错

时间:2015-01-25 17:40:53

标签: python hadoop hdfs popen hadoop-streaming

我试图在python脚本中读取(打开)和写入hdfs中的文件。但有错误。有人能告诉我这里有什么问题。

代码(完整):sample.py

#!/usr/bin/python

from subprocess import Popen, PIPE

print "Before Loop"

cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
            stdout=PIPE)

print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=PIPE)

print "After Loop 2"
for line in cat.stdout:
    line += "Blah"
    print line
    print "Inside Loop"
    put.stdin.write(line)

cat.stdout.close()
cat.wait()
put.stdin.close()
put.wait()

执行时:

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar -file ./sample.py -mapper './sample.py' -input sample.txt -output fileRead

它正确执行我找不到应该在hdfs modifiedfile中创建的文件

当我执行时:

 hadoop fs -getmerge ./fileRead/ file.txt

在file.txt中,我得到了:

Before Loop 
Before Loop 
After Loop 1    
After Loop 1    
After Loop 2    
After Loop 2

有人可以告诉我这里做错了什么吗?我不认为它从sample.txt中读取

2 个答案:

答案 0 :(得分:1)

尝试更改您的put子流程,通过更改此

来自行获取cat stdout
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=PIPE)

进入这个

put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=cat.stdout)

完整脚本:

#!/usr/bin/python

from subprocess import Popen, PIPE

print "Before Loop"

cat = Popen(["hadoop", "fs", "-cat", "./sample.txt"],
            stdout=PIPE)

print "After Loop 1"
put = Popen(["hadoop", "fs", "-put", "-", "./modifiedfile.txt"],
            stdin=cat.stdout)
put.communicate()

答案 1 :(得分:0)

  

有人可以告诉我这里我做错了吗?

您的sample.py可能不是合适的映射器。映射器可能接受其在stdin上的输入并将结果写入其标准输出,例如blah.py

#!/usr/bin/env python
import sys

for line in sys.stdin: # print("Blah\n".join(sys.stdin) + "Blah\n")
    line += "Blah"
    print(line)

用法:

$ hadoop ... -file ./blah.py -mapper './blah.py' -input sample.txt -output fileRead