Amazon EMR集群步骤成功但输出为空?

时间:2016-04-27 18:50:41

标签: python amazon-web-services amazon-s3 emr

我有两个为亚马逊EMR集群编写的脚本,而在我的电脑上它可以工作并创建一个文件,但是在步骤完成后我的s3存储桶中有3个部分文件和一个写入0字节的成功文件

这是mapper.py

#!/usr/bin/env python

import sys
import re

# Constants declaration

WINDOW = 10
OVERLAP = 4
START_POSITION = 0
END_POSITION = 0

# regular expressions

pattern = re.compile("[a-z]*", re.IGNORECASE)

a_to_f_pattern = re.compile("[a-f]", re.IGNORECASE)
g_to_l_pattern = re.compile("[g-l]", re.IGNORECASE)
m_to_r_pattern = re.compile("[m-r]", re.IGNORECASE)
s_to_z_pattern = re.compile("[s-z]", re.IGNORECASE)

# variables initialization

converted_word = ""
next_word = ""
new_character = ""
filename = ""
prev_filename = ""
i = 0

def convert(word):
    if a_to_f_pattern.match(word[0]):
        # print "found match!first pattern!", word[i], filename
        new_character = "A"
    elif g_to_l_pattern.match(word[0]):
        # print "found match!second pattern!", word[i], filename
        new_character = "C"
    elif m_to_r_pattern.match(word[0]):
        # print "found match!third pattern!", word[i], filename
        new_character = "G"
    elif s_to_z_pattern.match(word[0]):
        # print "found match!fourth pattern!", word[i], filename
        new_character = "T"
    return new_character

# Read pairs as lines of input from STDIN
for line in sys.stdin:

    line.strip()

    if ":" in line:
        filename, line = line.split(':',1)
        filename = filename.replace("source_text//", "")
        filename = filename.replace("suspicious_text//", "")

    # initialize prev_filename
    if prev_filename == "":
        prev_filename = filename

    # check if its a new file, and reset start position
    if filename != prev_filename:

        START_POSITION = 0
        next_word = ""
        converted_word = ""
        prev_filename = filename

    # loop through every word that matches the pattern
    for word in pattern.findall(line):
        while i < 1 and len(word) > 0:
            if len(converted_word) != WINDOW:

                new_character = convert(word)
                converted_word = converted_word + new_character

                if len(converted_word) > (WINDOW - OVERLAP):
                    next_word = next_word + new_character

                # print "word= ", word
                # print "converted_word= ", converted_word
            else:

                END_POSITION = START_POSITION + (len(converted_word) - 1)

                print converted_word + "," + str(filename) + "," + str(START_POSITION) + "," + str(END_POSITION)

                START_POSITION = START_POSITION + (WINDOW - OVERLAP)
                new_character = convert(word)
                converted_word = next_word + new_character
                # print "word= ", word
                # print "converted_word= ", converted_word
                next_word = ""
            # increment
            i = i + 1
        # reset value in order to be used for next word
        i = 0

这是reducer.py

#!/usr/bin/env python
import sys

for line in sys.stdin:
    # line = line.strip()
    with open('index.txt', 'a') as index_file:
        index_file.write(line)

这是EMR步骤的日志

2016-04-27 18:42:19,450 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2016-04-27 18:42:19,694 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461781964683 
2016-04-27 18:42:19,694 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-2IJ5EQ9RVEI01:i-364d48ee:RunJar:08615 period:60 /mnt/var/em/raw/i-364d48ee_20160427_RunJar_08615_raw.bin
2016-04-27 18:42:21,683 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-35-145.us-west-2.compute.internal/172.31.35.145:8032
2016-04-27 18:42:21,871 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-35-145.us-west-2.compute.internal/172.31.35.145:8032
2016-04-27 18:42:22,434 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/mapper.py' for reading
2016-04-27 18:42:22,553 INFO amazon.emr.metrics.MetricsSaver (main): Thread 1 created MetricsLockFreeSaver 1
2016-04-27 18:42:22,740 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/source_reducer.py' for reading
2016-04-27 18:42:22,936 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2016-04-27 18:42:22,939 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 426d94a07125cf9447bb0c2b336cf10b4c254375]
2016-04-27 18:42:23,285 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): listStatus s3://source123/input with recursive false
2016-04-27 18:42:23,505 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1
2016-04-27 18:42:23,601 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9
2016-04-27 18:42:25,047 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1461781956121_0002
2016-04-27 18:42:25,361 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1461781956121_0002
2016-04-27 18:42:25,407 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-35-145.us-west-2.compute.internal:20888/proxy/application_1461781956121_0002/
2016-04-27 18:42:25,408 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1461781956121_0002
2016-04-27 18:42:34,512 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461781956121_0002 running in uber mode : false
2016-04-27 18:42:34,514 INFO org.apache.hadoop.mapreduce.Job (main):  map 0% reduce 0%
2016-04-27 18:42:51,654 INFO org.apache.hadoop.mapreduce.Job (main):  map 11% reduce 0%
2016-04-27 18:42:54,675 INFO org.apache.hadoop.mapreduce.Job (main):  map 22% reduce 0%
2016-04-27 18:42:56,689 INFO org.apache.hadoop.mapreduce.Job (main):  map 33% reduce 0%
2016-04-27 18:42:59,708 INFO org.apache.hadoop.mapreduce.Job (main):  map 56% reduce 0%
2016-04-27 18:43:00,714 INFO org.apache.hadoop.mapreduce.Job (main):  map 67% reduce 0%
2016-04-27 18:43:08,760 INFO org.apache.hadoop.mapreduce.Job (main):  map 89% reduce 0%
2016-04-27 18:43:10,771 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 0%
2016-04-27 18:43:12,783 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 33%
2016-04-27 18:43:16,805 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 67%
2016-04-27 18:43:19,820 INFO org.apache.hadoop.mapreduce.Job (main):  map 100% reduce 100%
2016-04-27 18:43:20,833 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461781956121_0002 completed successfully
2016-04-27 18:43:20,982 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 55
    File System Counters
        FILE: Number of bytes read=114
        FILE: Number of bytes written=1541394
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=873
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=9
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
        S3: Number of bytes read=864
        S3: Number of bytes written=0
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
    Job Counters 
        Killed map tasks=1
        Launched map tasks=9
        Launched reduce tasks=3
        Data-local map tasks=9
        Total time spent by all maps in occupied slots (ms)=6477030
        Total time spent by all reduces in occupied slots (ms)=2225520
        Total time spent by all map tasks (ms)=143934
        Total time spent by all reduce tasks (ms)=24728
        Total vcore-milliseconds taken by all map tasks=143934
        Total vcore-milliseconds taken by all reduce tasks=24728
        Total megabyte-milliseconds taken by all map tasks=207264960
        Total megabyte-milliseconds taken by all reduce tasks=71216640
    Map-Reduce Framework
        Map input records=5
        Map output records=3
        Map output bytes=52
        Map output materialized bytes=489
        Input split bytes=873
        Combine input records=0
        Combine output records=0
        Reduce input groups=3
        Reduce shuffle bytes=489
        Reduce input records=3
        Reduce output records=0
        Spilled Records=6
        Shuffled Maps =27
        Failed Shuffles=0
        Merged Map outputs=27
        GC time elapsed (ms)=2100
        CPU time spent (ms)=11850
        Physical memory (bytes) snapshot=5267861504
        Virtual memory (bytes) snapshot=28408479744
        Total committed heap usage (bytes)=6102188032
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=864
    File Output Format Counters 
        Bytes Written=0
2016-04-27 18:43:20,983 INFO org.apache.hadoop.streaming.StreamJob (main): Output directory: s3://source123/output/

我在这里做错了什么?

0 个答案:

没有答案