mrjob找不到输入文件

时间:2015-10-08 02:03:30

标签: python python-2.7 hadoop mapreduce

我正在使用cloudera虚拟机。这是我的文件结构:

[cloudera@quickstart pydoop]$ hdfs dfs -ls -R /input
drwxr-xr-x   - cloudera supergroup          0 2015-10-02 15:00 /input/test1
-rw-r--r--   1 cloudera supergroup         62 2015-10-02 15:00 /input/test1/file1.txt
drwxr-xr-x   - cloudera supergroup          0 2015-10-02 14:59 /input/test2
-rw-r--r--   1 cloudera supergroup    1428841 2015-10-02 14:59 /input/test2/5000-8.txt
-rw-r--r--   1 cloudera supergroup     674570 2015-10-02 14:59 /input/test2/pg20417.txt
-rw-r--r--   1 cloudera supergroup    1573151 2015-10-02 14:59 /input/test2/pg4300.txt

这是执行wordcount示例的代码:

python /home/cloudera/MapReduceCode/mrjob/wordcount1.py -r hadoop hdfs://input/test1/file1.txt

它崩溃了以下。好像它无法找到该文件。

[cloudera@quickstart hadoop]$ python /home/cloudera/MapReduceCode/mrjob/wordcount1.py -r hadoop hdfs://input/test1/file1.txt
no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
Traceback (most recent call last):
  File "/home/cloudera/MapReduceCode/mrjob/wordcount1.py", line 13, in <module>
    MRWordCount.run()
  File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 461, in run
    mr_job.execute()
  File "/usr/local/lib/python2.7/site-packages/mrjob/job.py", line 479, in execute
    super(MRJob, self).execute()
  File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 153, in execute
    self.run_job()
  File "/usr/local/lib/python2.7/site-packages/mrjob/launch.py", line 216, in run_job
    runner.run()
  File "/usr/local/lib/python2.7/site-packages/mrjob/runner.py", line 470, in run
    self._run()
  File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 233, in _run
    self._check_input_exists()
  File "/usr/local/lib/python2.7/site-packages/mrjob/hadoop.py", line 247, in _check_input_exists
    if not self.path_exists(path):
  File "/usr/local/lib/python2.7/site-packages/mrjob/fs/composite.py", line 78, in path_exists
    return self._do_action('path_exists', path_glob)
  File "/usr/local/lib/python2.7/site-packages/mrjob/fs/composite.py", line 54, in _do_action
    return getattr(fs, action)(path, *args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/mrjob/fs/hadoop.py", line 212, in path_exists
    ok_stderr=[_HADOOP_LS_NO_SUCH_FILE])
  File "/usr/local/lib/python2.7/site-packages/mrjob/fs/hadoop.py", line 86, in invoke_hadoop
    proc = Popen(args, stdout=PIPE, stderr=PIPE)
  File "/usr/local/lib/python2.7/subprocess.py", line 709, in __init__
    errread, errwrite)
  File "/usr/local/lib/python2.7/subprocess.py", line 1326, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

1 个答案:

答案 0 :(得分:4)

请按照 Cloudera Quickstart VM 上的步骤进行操作。

  1. 确保 HADOOP_HOME 已设置。

    <强> export HADOOP_HOME=/usr/lib/hadoop

  2. symlink 设为** hadoop-streaming.jar

    <强> sudo ln -s /usr/lib/hadoop-mapreduce/hadoop-streaming.jar /usr/lib/hadoop

  3. 使用 hdfs:/// 代替 hdfs://

    <强> python /home/cloudera/MapReduceCode/mrjob/wordcount1.py -r hadoop hdfs:///input/test1/file1.txt

  4. 以下是来自 mrjob 的完整 my cloudera quickstart VM 结果。

    注意:wordcount1.py&amp;的位置file1.txt与你的不同,但没关系。

    [cloudera@quickstart ~]$ python wordcount1.py -r hadoop hdfs:///user/cloudera/file1.txt
    no configs found; falling back on auto-configuration
    no configs found; falling back on auto-configuration
    creating tmp directory /tmp/wordcount1.cloudera.20151011.115958.773999
    writing wrapper script to /tmp/wordcount1.cloudera.20151011.115958.773999/setup-wrapper.sh
    Using Hadoop version 2.6.0
    Copying local files into hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/files/
    
    PLEASE NOTE: Starting in mrjob v0.5.0, protocols will be strict by default. It's recommended you run your job with --strict-protocols or set up mrjob.conf as described at https://pythonhosted.org/mrjob/whats-new.html#ready-for-strict-protocols
    
    HADOOP: packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.4.2.jar] /tmp/streamjob3860196653022444549.jar tmpDir=null
    HADOOP: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
    HADOOP: Connecting to ResourceManager at quickstart.cloudera/127.0.0.1:8032
    HADOOP: Total input paths to process : 1
    HADOOP: number of splits:2
    HADOOP: Submitting tokens for job: job_1444564543695_0003
    HADOOP: Submitted application application_1444564543695_0003
    HADOOP: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1444564543695_0003/
    HADOOP: Running job: job_1444564543695_0003
    HADOOP: Job job_1444564543695_0003 running in uber mode : false
    HADOOP:  map 0% reduce 0%
    HADOOP:  map 100% reduce 0%
    HADOOP:  map 100% reduce 100%
    HADOOP: Job job_1444564543695_0003 completed successfully
    HADOOP: Counters: 49
    HADOOP:     File System Counters
    HADOOP:         FILE: Number of bytes read=105
    HADOOP:         FILE: Number of bytes written=356914
    HADOOP:         FILE: Number of read operations=0
    HADOOP:         FILE: Number of large read operations=0
    HADOOP:         FILE: Number of write operations=0
    HADOOP:         HDFS: Number of bytes read=322
    HADOOP:         HDFS: Number of bytes written=32
    HADOOP:         HDFS: Number of read operations=9
    HADOOP:         HDFS: Number of large read operations=0
    HADOOP:         HDFS: Number of write operations=2
    HADOOP:     Job Counters 
    HADOOP:         Launched map tasks=2
    HADOOP:         Launched reduce tasks=1
    HADOOP:         Data-local map tasks=2
    HADOOP:         Total time spent by all maps in occupied slots (ms)=1164160
    HADOOP:         Total time spent by all reduces in occupied slots (ms)=350080
    HADOOP:         Total time spent by all map tasks (ms)=9095
    HADOOP:         Total time spent by all reduce tasks (ms)=2735
    HADOOP:         Total vcore-seconds taken by all map tasks=9095
    HADOOP:         Total vcore-seconds taken by all reduce tasks=2735
    HADOOP:         Total megabyte-seconds taken by all map tasks=1164160
    HADOOP:         Total megabyte-seconds taken by all reduce tasks=350080
    HADOOP:     Map-Reduce Framework
    HADOOP:         Map input records=5
    HADOOP:         Map output records=15
    HADOOP:         Map output bytes=153
    HADOOP:         Map output materialized bytes=152
    HADOOP:         Input split bytes=214
    HADOOP:         Combine input records=0
    HADOOP:         Combine output records=0
    HADOOP:         Reduce input groups=3
    HADOOP:         Reduce shuffle bytes=152
    HADOOP:         Reduce input records=15
    HADOOP:         Reduce output records=3
    HADOOP:         Spilled Records=30
    HADOOP:         Shuffled Maps =2
    HADOOP:         Failed Shuffles=0
    HADOOP:         Merged Map outputs=2
    HADOOP:         GC time elapsed (ms)=148
    HADOOP:         CPU time spent (ms)=1470
    HADOOP:         Physical memory (bytes) snapshot=428871680
    HADOOP:         Virtual memory (bytes) snapshot=2197188608
    HADOOP:         Total committed heap usage (bytes)=144179200
    HADOOP:     Shuffle Errors
    HADOOP:         BAD_ID=0
    HADOOP:         CONNECTION=0
    HADOOP:         IO_ERROR=0
    HADOOP:         WRONG_LENGTH=0
    HADOOP:         WRONG_MAP=0
    HADOOP:         WRONG_REDUCE=0
    HADOOP:     File Input Format Counters 
    HADOOP:         Bytes Read=108
    HADOOP:     File Output Format Counters 
    HADOOP:         Bytes Written=32
    HADOOP: Output directory: hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/output
    Counters from step 1:
      (no counters found)
    Streaming final output from hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999/output
    "chars" 67
    "lines" 5
    "words" 16
    removing tmp directory /tmp/wordcount1.cloudera.20151011.115958.773999
    deleting hdfs:///user/cloudera/tmp/mrjob/wordcount1.cloudera.20151011.115958.773999 from HDFS
    [cloudera@quickstart ~]$