我目前正在使用Boto包在EMR实例上测试Python脚本。
该脚本读取名为fileC的文件的每一行,将其内容与fileAC中的行内容进行比较,并在单独的文件中写入已过滤的行。当我比较2个大文件时,我创建了另一个带有第三个文件的中间过滤器,名为fileA以获得时间。
问题如下:在本地计算机上测试期间,脚本运行正常,从fileC过滤掉200多行。但是一旦我尝试使用带有Boto包的AWS,该脚本根本不会过滤fileC(p = 0,没有显示“找到”,行数与fileC相同)。似乎没有读取2个文件fileA和fileAC。使用boto,我在函数“StreamingStep(”)中使用了“cache_files =”功能,以便将2个文件(fileA和fileAC)分发给每个集群。它曾用于其他脚本,但在这里却没有。有什么想法吗?
这是脚本:
sys.path.append(os.path.dirname(__file__))
def main(argv):
filenameAC = 'activities.log'
filenameA = 'activitiesCookieCountry.log'
fileC = fileinput.FileInput(sys.argv[1:])
fileA = open(filenameA,'r')
fileAC = open(filenameAC,'r')
fileA = [line.rstrip('\n') for line in fileA]
Alines = set(fileA)
for lineC in fileC:
fieldC = lineC.split('#')
fieldComp = fieldC[0]+'#'+fieldC[2]
p = 0
if fieldComp in Alines:
fileAC.seek(0)
for lineAC in fileAC:
fieldAC = lineAC.split('#')
if (fieldAC[0] == fieldC[0]) and (fieldAC[2] == fieldC[2]) and (fieldAC[1] < fieldC[1]):
p = 1
print('found')
if p == 0:
sys.stdout.write(lineC)
if __name__ == "__main__":
main(sys.argv)
以下是在EMR中运行脚本的脚本:
utils = ['s3n://blablabla/activities.log#activities.log','s3n://blablabla/Activities/activitiesCookieCountry.log#activitiesCookieCountry.log']
sargs = ['-jobconf','mapred.output.compress=true','-jobconf','mapred.output.compression.type=block','-jobconf','mapred.compress.map.output=true','-jobconf','stream.map.output.field.separator="#"','-jobconf','mapred.reduce.tasks="1"']
cACstep = StreamingStep(
name='ClickActivityCheck',
mapper='s3n://blablabla/click-formatting-ACheck-S3.py',
reducer=None,
input='s3n://blablabla/ClickCleanedFeb14/*.gz',
output='s3n://blablabla/ClickCleaned2Feb14',
cache_files=utils,
step_args=sargs
)
jobid = conn.run_jobflow(
name= 'AWS_Flow_Test',
log_uri='s3n://blablabla/Logging/jobflow_logs',
ec2_keyname=’xxxx’,
availability_zone=None,
master_instance_type='m1.small',
slave_instance_type='m1.small',
num_instances=4,
action_on_failure=None,
keep_alive=False,
enable_debugging=True,
hadoop_version='1.0.3',
steps=[cACstep],
bootstrap_actions=[],
instance_groups=None,
additional_info=None,
ami_version='2.4.1',
api_params=None,
visible_to_all_users=True)
我怀疑调用的2个文件不应该像这样定义:
filenameAC = 'activities.log'
filenameA = 'activitiesCookieCountry.log'
但我真的不知道怎么回事......