连接S3文件以在EMR中读取

时间:2013-05-02 23:06:36

标签: hadoop amazon-web-services amazon-s3 elastic-map-reduce emr

我有一个带有我要连接的日志文件的S3存储桶,然后用作EMR作业的输入。日志文件位于以下路径中:bucket-name/[date]/product/out/[hour]/[minute-based-file]。我想在所有日期目录中的所有小时目录中记录所有分钟日志,并将它们连接成一个文件。我想将该文件用作EMR作业的输入。需要保留原始日志文件,新的组合日志文件可能会写入不同的S3存储桶。

我尝试通过SSH在EMR主节点上使用hadoop fs -getmerge,但是出现了这个错误:

This file system object (file:///) does not support access to the request path 's3://target-bucket-name/merged.log'

源S3存储桶中包含其他一些文件,因此我不想包含所有的文件。通配符匹配如下所示:s3n://bucket-name/*/product/out/*/log.*

目的是解决向EMR提供数十/数十万个小(10k-3mb)输入文件的问题,而是给它一个大文件,它可以更有效地分割。

1 个答案:

答案 0 :(得分:3)

我最后只编写了一个包含一些Hadoop文件系统命令的脚本来执行此操作。

#!/usr/bin/env ruby

require 'date'

# Merge minute-based log files into daily log files
# Usage: Run on EMR master (e.g. SSH to master then `ruby ~/merge-historical-logs.rb [FROM [TO]]`)

SOURCE_BUCKET_NAME      = 's3-logs-bucket'
DESTINATION_BUCKET_NAME = 's3-merged-logs-bucket'

# Optional date inputs
min_date = if ARGV[0]
  min_date_args = ARGV[0].split('-').map {|item| item.to_i}
  Date.new(*min_date_args)
else
  Date.new [2012, 9, 1]
end

max_date = if ARGV[1]
  max_date_args = ARGV[1].split('-').map {|item| item.to_i}
  Date.new(*max_date_args)
else
  Date.today
end

# Setup directories
hdfs_logs_dir = '/mnt/tmp/logs'
local_tmp_dir = './_tmp_merges'

puts "Cleaning up filesystem"
system "hadoop fs -rmr #{hdfs_logs_dir}"
system "rm -rf #{local_tmp_dir}*"

puts "Making HDFS directories"
system "hadoop fs -mkdir #{hdfs_logs_dir}"

# We will progress backwards, from max to min
date = max_date
while date >= min_date
  # Format date pieces
  year  = date.year
  month = "%02d" % date.month
  day   = "%02d" % date.day

  # Make a directory in HDFS to store this day's hourly logs
  today_hours_dir = "#{hdfs_logs_dir}/#{year}-#{month}-#{day}"
  puts "Making today's hourly directory"
  system "hadoop fs -mkdir #{today_hours_dir}"

  # Break the day's hours into a few chunks
  # This seems to avoid some problems when we run lots of getmerge commands in parallel
  [*(0..23)].each_slice(8).to_a.each do |hour_chunk|
    hour_chunk.each do |_hour|
      hour = "%02d" % _hour

      # Setup args to merge minute logs into hour logs
      source_file = "s3://#{SOURCE_BUCKET_NAME}/#{year}-#{month}-#{day}/product/out/#{hour}/"
      output_file = "#{local_tmp_dir}/#{hour}.log"

      # Launch each hour's getmerge in the background
      full_command = "hadoop fs -getmerge #{source_file} #{output_file}"
      puts "Forking: #{full_command}"
      fork { system full_command }
    end

    # Wait for this batch of the germerge's to finish
    Process.waitall
  end

  # Delete the local temp files Hadoop created
  puts "Removing temp files"
  system "rm #{local_tmp_dir}/.*.crc"

  # Move local hourly logs to hdfs to free up local space
  puts "Moving local logs to HDFS"
  system "hadoop fs -put #{local_tmp_dir}/* #{today_hours_dir}"

  puts "Removing local logs"
  system "rm -rf #{local_tmp_dir}"

  # Merge the day's hourly logs into a single daily log file
  daily_log_file_name = "#{year}-#{month}-#{day}.log"
  daily_log_file_path = "#{local_tmp_dir}_day/#{daily_log_file_name}"
  puts "Merging hourly logs into daily log"
  system "hadoop fs -getmerge #{today_hours_dir}/ #{daily_log_file_path}"

  # Write the daily log file to another s3 bucket
  puts "Writing daily log to s3"
  system "hadoop fs -put #{daily_log_file_path} s3://#{DESTINATION_BUCKET_DIR}/daily-merged-logs/#{daily_log_file_name}"

  # Remove daily log locally
  puts "Removing local daily logs"
  system "rm -rf #{local_tmp_dir}_day"

  # Remove the hourly logs from HDFS
  puts "Removing HDFS hourly logs"
  system "hadoop fs -rmr #{today_hours_dir}"

  # Go back in time
  date -= 1
end