如何使用总和和最大日期来映射/减少?

时间:2015-01-21 22:02:20

标签: python ruby hadoop mapreduce hadoop-streaming

我有一个文件,我需要map / reduced,输出需要总和和日期的最大值。我有总和部分工作,但是,我不知道如何将最大日期作为减少输出的一部分。

输入数据如下所示:

ID1,  ID2, date,                count
3000, 001, 2014-12-30 18:00:00, 2
3000, 001, 2015-01-01 10:00:00, 1
3000, 002, 2014-11-18 12:53:00, 5
3000, 002, 2014-12-20 20:14:00, 3

我的映射器连接ID1 + ID2,以便将它们分组。它的输出如下:

key (ID1|ID2), value (count)
3000|001,      2
3000|001,      1
3000|002,      5
3000|002,      3

reducer输出如下所示:

key (ID1|ID2), value (sum)
3000|001,      3
3000|002,      8

我真正需要的是这样输出:

key (ID1|ID2), value (sum), date (max)
3000|001,      3,           2015-01-01 10:00:00
3000|002,      8,           2014-12-20 20:14:00

mapper和reducer是用Ruby编写的,但是,我将用Python编写的一个工作示例(我将它翻译成Ruby)。

这是映射器代码:

require 'csv'

pattern = File.join(File.expand_path('data', File.dirname(__FILE__)), '*.txt')

Dir.glob(pattern).each do |file|
  CSV.foreach(file, {col_sep: "\t", headers: false}) do |row|
    puts [
           "#{row[6]}|#{row[3].rjust(8, '0')}", # key = ID1 | ID2
           row[7] # value = count
         ].join("\t")
    end
end

还原剂:

prev_key  = nil
key_total = 0

ARGF.each do |line|
  line = line.chomp
  next unless line

  (key, value) = line.split("\t")

  # check for new key
  if prev_key && key != prev_key && key_total > 0

    # output total for previous key
    puts [prev_key, key_total].join("\t")

    # reset key total for new key
    prev_key  = key
    key_total = 0

  elsif !prev_key
    prev_key = key

  end

  # add to count for this current key
  key_total += value.to_i

end

# this is to catch the final counts after all records have been received
puts [prev_key, key_total].join("\t")

更新

这是基于接受答案建议的新映射器和缩减器:

映射器:

require 'csv'

pattern = File.join(File.expand_path('data', File.dirname(__FILE__)), '*.txt')

Dir.glob(pattern).each do |file|
  CSV.foreach(file, {col_sep: "\t", headers: false}) do |row|
    date_time = "#{row[0]} #{row[1]}:00:00#{row[2]}" # %Y-%m-%d %H:%M:%S%z
    puts [
             "#{row[6]}|#{row[3].rjust(8, '0')}", # key = ID1 | ID2
             "#{row[7]}|#{date_time}", # value = count | date_time
         ].join("\t")
  end
end

减速器:

require 'date'

prev_key  = nil
key_total = 0
dates = []

ARGF.each do |line|
  line = line.chomp
  next unless line

  (key, values) = line.split("\t")
  (value, date_time) = values.split('|')

  # check for new key
  if prev_key && key != prev_key && key_total > 0

    # output total for previous key
    puts [prev_key.split('|'), key_total, dates.max].join("\t")

    # reset key total for new key
    prev_key  = key
    key_total = 0

    # reset dates array for new key
    dates.clear

  elsif !prev_key
    prev_key = key

  end

  # add date to array for this current key
  dates << DateTime.strptime(date_time, '%Y-%m-%d %H:%M:%S%z')

  # add to count for this current key
  key_total += value.to_i

end

# this is to catch the final counts after all records have been received
puts [prev_key.split('|'), key_total, dates.max].join("\t") 

1 个答案:

答案 0 :(得分:0)

您只需要将日期和计数放入一对&lt; date,count&gt;并将其作为映射器中的值发出。然后在reducer中提取日期并从对值计数。总和按您的方式计算,并跟踪输入值(每个键)的最大日期。