按日期范围对大量哈希进行分组

时间:2014-02-28 18:36:06

标签: ruby arrays hash

我有一个大型数组(~5MB)的哈希值,我需要通过滚动日期范围进行分组。

这是将数组转换为我正在寻找的滚动数据集的Ruby方法:

def rolling(options = {})
  rolling_items = []

  options[:date_range].each do |day|
    start_date = rolling_start_date(day)
    end_date = day

    range = start_date..end_date

    new_items = options[:data].select{|key, value| range.cover? Date.parse(key[:created].to_s)}.uniq { |h| h[:customer] }

    amount = new_items.count


    rolling_items.push({created: day, amount: amount})
  end

  rolling_items
end

调用一个rolling_start_date方法,该方法需要一天,然后吐出它的开始日期:

def rolling_start_date(end_date)
  old = Time.utc(end_date.year, end_date.month, end_date.day)
  previous = old - 1.month

  if old.day > previous.day
     start_date = previous + 1.day
  else
     start_date = old - 1.month + 1.day
  end

  start_date.to_date
end

我将rolling方法称为rolling(date_range: Date.current.beginning_of_day-1.year..Date.current.end_of_day, data: customers)

这里有gist of the huge array个客户。在上面的调用中用于data

然后,rolling方法会遍历整个date_range中的每一天并查找其rolling_start_date,然后在这种情况下,查找新日期范围内的哈希值并计算独特的客户并将其推送到一个新的rolling_items数组,所以我最终得到一个如下所示的数组:

[
   {:created=>Fri, 21 Feb 2014, :amount=>2711}, 
   {:created=>Sat, 22 Feb 2014, :amount=>2716}, 
   {:created=>Sun, 23 Feb 2014, :amount=>2720}, 
   {:created=>Mon, 24 Feb 2014, :amount=>2731}, 
   {:created=>Tue, 25 Feb 2014, :amount=>2746}, 
   {:created=>Wed, 26 Feb 2014, :amount=>2761}, 
   {:created=>Thu, 27 Feb 2014, :amount=>2765}, 
   {:created=>Fri, 28 Feb 2014, :amount=>2754}, 
   ...
]

...其中每个哈希值是日期范围内唯一客户的总数。

因此,我需要弄清楚如何做到这一点,仍然可以获得每个滚动日期范围的唯一客户数量,而无需在365倍的情况下循环使用5MB阵列。

1 个答案:

答案 0 :(得分:0)

也许我不理解其目的,但您是否可以只迭代customers数组一次,并确定每个客户计算的天数范围?如果我理解正确,这个范围总是一个月,所以我可以简单地说,在2013年2月1日创建计划的客户X将在2月1日至2月28日期间的所有日期添加1个唯一客户,对吗?也就是说,考虑到我们还没有计算他(独特客户),每个客户只需“生成”所有那些日子的+1。再说一次,也许我不能让你正确,但如果我说的话你可以做的事情就是这样:

rolling_items = {}

customers.each do |customer|
  start_date = Date.parse(customer[:created])
  end_date = start_date + 30
  (start_date..end_date).each do |date|
    # Add empty Hash with default value 0 if date was not yet in Hash.
    # Add 1 for the customer, so we can see duplicates if we want
    (rolling_items[date] ||= Hash.new(0))[customer[:customer]] += 1
  end
end

rolling_items.each do |date, customers|
  uniq_customers = customers.keys.size # Hash keys are already unique, just count
  puts "\n%s => %s unique customers" % [date.strftime, uniq_customers]
  puts "-" * 20
  customers.each do |customer, times|
    puts "%s => %d" % [customer, times]
  end
end

# 2013-02-28 => 7 unique customers
# --------------------
# cus_05eOKvdnc3MkJO => 2
# cus_0e7LBxIfqSyLAP => 2
# cus_05HVTILpv7CuVS => 2
# cus_1CD4BnX3jDcA3g => 2
# cus_0G9GwU25yAT0ih => 1
# cus_1BqrfANA13SoNc => 3
# cus_0S12vFMb8r6ef1 => 2

# 2013-03-01 ... etc

顺便说一句,那里有很多重复的客户条目,其中有相同的日期,我不确定是否有意。我拿了巨型阵列的前14项。