Question

我有一个HDFS文件夹，其中包含20 GB的.xml文件。我想通过一个python脚本（script.py）运行它们，它接受一个.xml文件，进行更改并输出一个新文件。例如，在本地的.xml文件中：

python script.py file1.xml > file2.xml

我有以下PIG脚本：

input_data = LOAD 'hdfs/path/to/input/dir';
DEFINE mycommand `python script.py` ship('/path/to/my/script.py/');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;
STORE updated_data INTO 'hdfs/path/to/output/dir';

当我运行PIG脚本（./pig /path/to/pig/script.pig）时，我收到错误：

Backend error message
ERROR 2055: Received Error while processing the map plan: 'python script.py 
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)'
failed with exit status: 1

这让我相信'mycommand'存在问题。我们是否需要指定script.py将.xml文件作为输入（就像我们在本地执行的那样）？如果是这样，语法是什么？

DEFINE mycommand `python script.py input_data > updated_data(?)`

还是有其他问题我没看到？

通过PIG通过python脚本在HFDS上运行文件

0 个答案: