使用mrjob读取多个HDFS文件或S3文件?

时间:2015-12-07 03:29:26

标签: hadoop mrjob

我有大量数据存储在HDFS系统中(或者,在Amazon S3中)。

我想用mrjob处理它。

不幸的是,当运行mrjob并提供HDFS文件名或包含目录名时,我收到错误。

例如,这里我将数据存储在目录hdfs://user/hadoop/in1/中。为了测试我的文件是hdfs://user/hadoop/in1/BCES_FY2014_clean.csv,但在生产中我会想要多个文件。

该文件存在:

$ hdfs dfs -ls /user/hadoop/in1/
Found 1 items
-rw-r--r--   1 hadoop hadoop    1771685 2015-12-07 03:05 /user/hadoop/in1/BCES_FY2014_clean.csv
$ 

但是当我尝试用mrjob运行它时,我收到了这个错误:

$ python mrjob_salary_max.py -r hadoop hdfs://user/hadoop/in1/BCES_FY2014_clean.csv 
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
STDERR: -ls: java.net.UnknownHostException: user
STDERR: Usage: hadoop fs [generic options] -ls [-d] [-h] [-R] [<path> ...]
Traceback (most recent call last):
  File "mrjob_salary_max.py", line 26, in <module>
    salarymax.run()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/job.py", line 461, in run
    mr_job.execute()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/job.py", line 479, in execute
    super(MRJob, self).execute()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/launch.py", line 153, in execute
    self.run_job()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/launch.py", line 216, in run_job
    runner.run()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/runner.py", line 470, in run
    self._run()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/hadoop.py", line 233, in _run
    self._check_input_exists()
  File "/usr/local/lib/python2.6/site-packages/mrjob-0.4.6-py2.6.egg/mrjob/hadoop.py", line 249, in _check_input_exists
    'Input path %s does not exist!' % (path,))
AssertionError: Input path hdfs://user/hadoop/in1/BCES_FY2014_clean.csv does not exist!
$ 

当mrjob从本地文件系统中读出时,它可以工作,但不会扩展。

1 个答案:

答案 0 :(得分:0)

显然,正确的HDFS URL在冒号后面有三个斜杠:

python mrjob_salary_max.py -r hadoop hdfs:///user/hadoop/in1/BCES_FY2014_clean.csv 

如果您只提供目录,它将读取目录中的所有文件:

python mrjob_salary_max.py -r hadoop hdfs:///user/hadoop/in1/