我有两个为亚马逊EMR集群编写的脚本,而在我的电脑上它可以工作并创建一个文件,但是在步骤完成后我的s3存储桶中有3个部分文件和一个写入0字节的成功文件
这是mapper.py
#!/usr/bin/env python
import sys
import re
# Constants declaration
WINDOW = 10
OVERLAP = 4
START_POSITION = 0
END_POSITION = 0
# regular expressions
pattern = re.compile("[a-z]*", re.IGNORECASE)
a_to_f_pattern = re.compile("[a-f]", re.IGNORECASE)
g_to_l_pattern = re.compile("[g-l]", re.IGNORECASE)
m_to_r_pattern = re.compile("[m-r]", re.IGNORECASE)
s_to_z_pattern = re.compile("[s-z]", re.IGNORECASE)
# variables initialization
converted_word = ""
next_word = ""
new_character = ""
filename = ""
prev_filename = ""
i = 0
def convert(word):
if a_to_f_pattern.match(word[0]):
# print "found match!first pattern!", word[i], filename
new_character = "A"
elif g_to_l_pattern.match(word[0]):
# print "found match!second pattern!", word[i], filename
new_character = "C"
elif m_to_r_pattern.match(word[0]):
# print "found match!third pattern!", word[i], filename
new_character = "G"
elif s_to_z_pattern.match(word[0]):
# print "found match!fourth pattern!", word[i], filename
new_character = "T"
return new_character
# Read pairs as lines of input from STDIN
for line in sys.stdin:
line.strip()
if ":" in line:
filename, line = line.split(':',1)
filename = filename.replace("source_text//", "")
filename = filename.replace("suspicious_text//", "")
# initialize prev_filename
if prev_filename == "":
prev_filename = filename
# check if its a new file, and reset start position
if filename != prev_filename:
START_POSITION = 0
next_word = ""
converted_word = ""
prev_filename = filename
# loop through every word that matches the pattern
for word in pattern.findall(line):
while i < 1 and len(word) > 0:
if len(converted_word) != WINDOW:
new_character = convert(word)
converted_word = converted_word + new_character
if len(converted_word) > (WINDOW - OVERLAP):
next_word = next_word + new_character
# print "word= ", word
# print "converted_word= ", converted_word
else:
END_POSITION = START_POSITION + (len(converted_word) - 1)
print converted_word + "," + str(filename) + "," + str(START_POSITION) + "," + str(END_POSITION)
START_POSITION = START_POSITION + (WINDOW - OVERLAP)
new_character = convert(word)
converted_word = next_word + new_character
# print "word= ", word
# print "converted_word= ", converted_word
next_word = ""
# increment
i = i + 1
# reset value in order to be used for next word
i = 0
这是reducer.py
#!/usr/bin/env python
import sys
for line in sys.stdin:
# line = line.strip()
with open('index.txt', 'a') as index_file:
index_file.write(line)
这是EMR步骤的日志
2016-04-27 18:42:19,450 INFO com.amazon.ws.emr.hadoop.fs.EmrFileSystem (main): Consistency disabled, using com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem as filesystem implementation
2016-04-27 18:42:19,694 INFO amazon.emr.metrics.MetricsSaver (main): MetricsConfigRecord disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec: 60 disableClusterEngine: true maxMemoryMb: 3072 maxInstanceCount: 500 lastModified: 1461781964683
2016-04-27 18:42:19,694 INFO amazon.emr.metrics.MetricsSaver (main): Created MetricsSaver j-2IJ5EQ9RVEI01:i-364d48ee:RunJar:08615 period:60 /mnt/var/em/raw/i-364d48ee_20160427_RunJar_08615_raw.bin
2016-04-27 18:42:21,683 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-35-145.us-west-2.compute.internal/172.31.35.145:8032
2016-04-27 18:42:21,871 INFO org.apache.hadoop.yarn.client.RMProxy (main): Connecting to ResourceManager at ip-172-31-35-145.us-west-2.compute.internal/172.31.35.145:8032
2016-04-27 18:42:22,434 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/mapper.py' for reading
2016-04-27 18:42:22,553 INFO amazon.emr.metrics.MetricsSaver (main): Thread 1 created MetricsLockFreeSaver 1
2016-04-27 18:42:22,740 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): Opening 's3://source123/source_reducer.py' for reading
2016-04-27 18:42:22,936 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2016-04-27 18:42:22,939 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev 426d94a07125cf9447bb0c2b336cf10b4c254375]
2016-04-27 18:42:23,285 INFO com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem (main): listStatus s3://source123/input with recursive false
2016-04-27 18:42:23,505 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 1
2016-04-27 18:42:23,601 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): number of splits:9
2016-04-27 18:42:25,047 INFO org.apache.hadoop.mapreduce.JobSubmitter (main): Submitting tokens for job: job_1461781956121_0002
2016-04-27 18:42:25,361 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl (main): Submitted application application_1461781956121_0002
2016-04-27 18:42:25,407 INFO org.apache.hadoop.mapreduce.Job (main): The url to track the job: http://ip-172-31-35-145.us-west-2.compute.internal:20888/proxy/application_1461781956121_0002/
2016-04-27 18:42:25,408 INFO org.apache.hadoop.mapreduce.Job (main): Running job: job_1461781956121_0002
2016-04-27 18:42:34,512 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461781956121_0002 running in uber mode : false
2016-04-27 18:42:34,514 INFO org.apache.hadoop.mapreduce.Job (main): map 0% reduce 0%
2016-04-27 18:42:51,654 INFO org.apache.hadoop.mapreduce.Job (main): map 11% reduce 0%
2016-04-27 18:42:54,675 INFO org.apache.hadoop.mapreduce.Job (main): map 22% reduce 0%
2016-04-27 18:42:56,689 INFO org.apache.hadoop.mapreduce.Job (main): map 33% reduce 0%
2016-04-27 18:42:59,708 INFO org.apache.hadoop.mapreduce.Job (main): map 56% reduce 0%
2016-04-27 18:43:00,714 INFO org.apache.hadoop.mapreduce.Job (main): map 67% reduce 0%
2016-04-27 18:43:08,760 INFO org.apache.hadoop.mapreduce.Job (main): map 89% reduce 0%
2016-04-27 18:43:10,771 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 0%
2016-04-27 18:43:12,783 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 33%
2016-04-27 18:43:16,805 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 67%
2016-04-27 18:43:19,820 INFO org.apache.hadoop.mapreduce.Job (main): map 100% reduce 100%
2016-04-27 18:43:20,833 INFO org.apache.hadoop.mapreduce.Job (main): Job job_1461781956121_0002 completed successfully
2016-04-27 18:43:20,982 INFO org.apache.hadoop.mapreduce.Job (main): Counters: 55
File System Counters
FILE: Number of bytes read=114
FILE: Number of bytes written=1541394
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=873
HDFS: Number of bytes written=0
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
S3: Number of bytes read=864
S3: Number of bytes written=0
S3: Number of read operations=0
S3: Number of large read operations=0
S3: Number of write operations=0
Job Counters
Killed map tasks=1
Launched map tasks=9
Launched reduce tasks=3
Data-local map tasks=9
Total time spent by all maps in occupied slots (ms)=6477030
Total time spent by all reduces in occupied slots (ms)=2225520
Total time spent by all map tasks (ms)=143934
Total time spent by all reduce tasks (ms)=24728
Total vcore-milliseconds taken by all map tasks=143934
Total vcore-milliseconds taken by all reduce tasks=24728
Total megabyte-milliseconds taken by all map tasks=207264960
Total megabyte-milliseconds taken by all reduce tasks=71216640
Map-Reduce Framework
Map input records=5
Map output records=3
Map output bytes=52
Map output materialized bytes=489
Input split bytes=873
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=489
Reduce input records=3
Reduce output records=0
Spilled Records=6
Shuffled Maps =27
Failed Shuffles=0
Merged Map outputs=27
GC time elapsed (ms)=2100
CPU time spent (ms)=11850
Physical memory (bytes) snapshot=5267861504
Virtual memory (bytes) snapshot=28408479744
Total committed heap usage (bytes)=6102188032
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=864
File Output Format Counters
Bytes Written=0
2016-04-27 18:43:20,983 INFO org.apache.hadoop.streaming.StreamJob (main): Output directory: s3://source123/output/
我在这里做错了什么?