Hadoop返回的结果少于预期

时间:2016-04-27 20:16:45

标签: python hadoop amazon-web-services emr

我有两个python脚本mapper和reducer(此时基本上是reducer只打印其他内容)而本地我得到4个结果 - 字符串 在hadoop我得到3.这是如何工作的?

我使用Hadoop使用Amazon Elastic Map Reduce

mapper.py

#!/usr/bin/env python

import sys
import re
import os
# Constants declaration

WINDOW = 10
OVERLAP = 4
START_POSITION = 0
END_POSITION = 0

# regular expressions

pattern = re.compile("[a-z]*", re.IGNORECASE)

a_to_f_pattern = re.compile("[a-f]", re.IGNORECASE)
g_to_l_pattern = re.compile("[g-l]", re.IGNORECASE)
m_to_r_pattern = re.compile("[m-r]", re.IGNORECASE)
s_to_z_pattern = re.compile("[s-z]", re.IGNORECASE)

# variables initialization

converted_word = ""
next_word = ""
new_character = ""
filename = ""
prev_filename = ""
i = 0



# Read pairs as lines of input from STDIN
for line in sys.stdin:

    line.strip()

    filename = os.environ['mapreduce_map_input_file']
    filename = filename.replace("s3://source123/input/","")


    # check if its a new file, and reset start position
    if filename != prev_filename:

        START_POSITION = 0
        next_word = ""
        converted_word = ""
        prev_filename = filename

    # loop through every word that matches the pattern
    for word in pattern.findall(line):


                new_character = convert(word)
                converted_word = converted_word + new_character

                if len(converted_word) > (WINDOW - OVERLAP):
                    next_word = next_word + new_character

                # print "word= ", word
                # print "converted_word= ", converted_word
            else:

                END_POSITION = START_POSITION + (len(converted_word) - 1)

                print converted_word + "," + str(filename) + "," + str(START_POSITION) + "," + str(END_POSITION)

                START_POSITION = START_POSITION + (WINDOW - OVERLAP)
                new_character = convert(word)
                converted_word = next_word + new_character

日志

2016-04-27 19:58:41,293 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2016-04-27 19:58:41,512 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461784308237 
2016-04-27 19:58:41,512 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-KCDMFZJGYO89:i-995f5a41:RunJar:16480 period:60 /mnt/var/em/raw/i-995f5a41_20160427_RunJar_16480_raw.bin
2016-04-27 19:58:43,477 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-38-52.us-west-2.compute.internal/172.31.38.52:8032
2016-04-27 19:58:43,673 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-38-52.us-west-2.compute.internal/172.31.38.52:8032
2016-04-27 19:58:44,156 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/mapper.py' for reading
2016-04-27 19:58:44,267 INFO amazon.emr.metrics.MetricsSaver (main): Thread 1 created MetricsLockFreeSaver 1
2016-04-27 19:58:44,439 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/source_reducer.py' for reading
2016-04-27 19:58:44,628 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2016-04-27 19:58:44,630 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 426d94a07125cf9447bb0c2b336cf10b4c254375]
2016-04-27 19:58:45,046 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): listStatus s3://source123/input with recursive false
2016-04-27 19:58:45,265 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1
2016-04-27 19:58:45,336 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9
2016-04-27 19:58:45,565 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1461784297295_0004
2016-04-27 19:58:45,710 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1461784297295_0004
2016-04-27 19:58:45,743 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-38-52.us-west-2.compute.internal:20888/proxy/application_1461784297295_0004/
2016-04-27 19:58:45,744 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1461784297295_0004
2016-04-27 19:58:53,876 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461784297295_0004 running in uber mode : false
2016-04-27 19:58:53,877 INFO org.apache.hadoop.mapreduce.Job (main):  map 0% reduce 0%
2016-04-27 19:59:11,063 INFO org.apache.hadoop.mapreduce.Job (main):  map 11% reduce 0%
2016-04-27 19:59:14,081 INFO org.apache.hadoop.mapreduce.Job (main):  map 22% reduce 0%
2016-04-27 19:59:16,094 INFO org.apache.hadoop.mapreduce.Job (main):  map 33% reduce 0%
2016-04-27 19:59:18,106 INFO org.apache.hadoop.mapreduce.Job (main):  map 56% reduce 0%
2016-04-27 19:59:19,114 INFO org.apache.hadoop.mapreduce.Job (main):  map 67% reduce 0%
2016-04-27 19:59:26,159 INFO org.apache.hadoop.mapreduce.Job (main):  map 78% reduce 0%
2016-04-27 19:59:29,178 INFO org.apache.hadoop.mapreduce.Job (main):  map 89% reduce 0%
2016-04-27 19:59:30,184 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 0%
2016-04-27 19:59:32,196 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 33%
2016-04-27 19:59:34,207 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 67%
2016-04-27 19:59:38,228 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 100%
2016-04-27 19:59:40,246 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461784297295_0004 completed successfully
2016-04-27 19:59:40,409 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 55
    File System Counters
        FILE: Number of bytes read=190
        FILE: Number of bytes written=1541379
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=873
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
        S3: Number of bytes read=864
        S3: Number of bytes written=130
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
    Job Counters 
        Killed map tasks=1
        Launched map tasks=9
        Launched reduce tasks=3
        Data-local map tasks=9
        Total time spent by all maps in occupied slots (ms)=6351210
        Total time spent by all reduces in occupied slots (ms)=2449170
        Total time spent by all map tasks (ms)=141138
        Total time spent by all reduce tasks (ms)=27213
        Total vcore-milliseconds taken by all map tasks=141138
        Total vcore-milliseconds taken by all reduce tasks=27213
        Total megabyte-milliseconds taken by all map tasks=203238720
        Total megabyte-milliseconds taken by all reduce tasks=78373440
    Map-Reduce Framework
        Map input records=5
        Map output records=3
        Map output bytes=124
        Map output materialized bytes=562
        Input split bytes=873
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=562
        Reduce input records=3
        Reduce output records=6
        Spilled Records=6
        Shuffled Maps =27
        Failed Shuffles=0
        Merged Map outputs=27
        GC time elapsed (ms)=2785
        CPU time spent (ms)=11670
        Physical memory (bytes) snapshot=5282500608
        Virtual memory (bytes) snapshot=28472725504
        Total committed heap usage (bytes)=5977407488
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=864
    File Output Format Counters 
        Bytes Written=130
2016-04-27 19:59:40,409 INFO org.apache.hadoop.streaming.StreamJob (main): Output directory: s3://source123/output/

1 个答案:

答案 0 :(得分:2)

映射器任务将其输入转换为行并将行提供给进程的标准输入。

在这种情况下,您有多个输入文件,并且您假设来自不同文件的所有行都按顺序(即逐个文件)提供,但是可能在 parallel 中进行处理,因此映射器(获取几个输入文件)可能会重置其计数器,而不是顺序分发。