更新 - 部分解决。

Question

我忙于处理ETL管道，但对于这个特殊问题，我需要获取一个数据表，并将每列转换为一个集合 - 即一个唯一的数组。

我正在努力探讨如何在Kiba框架内实现这一目标。

这是我想要实现的目标的本质：

来源：

[
  { dairy: "Milk",   protein: "Steak",   carb: "Potatoes" },
  { dairy: "Milk",   protein: "Eggs",    carb: "Potatoes" },
  { dairy: "Cheese", protein: "Steak",   carb: "Potatoes" },
  { dairy: "Cream",  protein: "Chicken", carb: "Potatoes" },
  { dairy: "Milk",   protein: "Chicken", carb: "Pasta" },
]

目标

{
  dairy:   ["Milk", "Cheese", "Cream"],
  protein: ["Steak", "Eggs", "Chicken"],
  carb:    ["Potatoes", "Pasta"],
}

这样的事情a）在Kiba是可行的，b）甚至可以在Kiba做吗？

非常感谢任何帮助。

更新 - 部分解决。

我找到了部分解决方案。这个转换器类会将行表转换为集合的散列，但我仍然坚持如何使用ETL目标来获取数据。我怀疑我是否以某种不打算使用Kiba的方式使用Kiba。

class ColumnSetTransformer
  def initialize
    @col_set = Hash.new(Set.new)
  end

  def process(row)
    row.each do |col, col_val|
      @col_set[col] = @col_set[col] + [col_val]
    end

    @col_set
  end
end

Answer 1

你的解决方案可以正常工作，而且确实在Kiba中进行这样设计的原因（主要是＆＃34; Plain Old Ruby Objects＆＃34;）是为了让你自己轻松调用组件，如果你需要它！（这对测试非常有用！）。

这里说的是一些额外的可能性。

您正在做的是一种聚合形式，可以通过各种方式实施。

缓冲目的地

实际上缓冲区将是一行。使用如下代码：

class MyBufferingDestination
  attr_reader :single_output_row

  def initialize(config:)
    @single_output_row = []
  end

  def write(row)
    row.each do |col, col_val|
      single_output_row[col] += [col_val]
    end
  end

  def close # will be called by Kiba at the end of the run
    # here you'd write your output
  end
end

使用实例变量聚合+ post_process块

pre_process do
  @output_row = {}
end

transform do |row|
  row.each do |col, col_val|
    @output_row = # SNIP
  end      
  row
end

post_process do
  # convert @output_row to something
  # you can invoke a destination manually, or do something else
end

很快可能：使用缓冲转换

正如here所述，很快就可以创建缓冲变换，以便更好地将聚合机制与目标本身分离。

它会是这样的：

class MyAggregatingTransform
  def process(row)
    @aggregate += xxx
    nil # remove the row from the pipeline
  end

  def close
    # not yet possible, but soon
    yield @aggregate
  end
end

这将是最好的设计，因为这样您就可以重用现有的目的地，而无需修改它们以支持缓冲，因此它们将变得更加通用。可重复使用的：

transform MyAggregatingTransform

destination MyJSONDestination, file: "some.json"

通过检测输入数据集中的边界，甚至可以在目的地中有多行。相应地屈服。

一旦可能，我会更新SO答案。

Answer 2

好的 - 所以，在工作环境中使用Kiba似乎不是这个工具的使用方式。我想使用Kiba，因为我已经为这个项目实现了很多相关的E，T和L代码，并且重用将是巨大的。

所以，如果我有代码重用，但我不能在Kiba框架中使用它，我可以称之为正常代码。这完全归功于Thibaut极其简单的设计！

以下是我解决问题的方法：

source  = CSVOrXLSXSource.new("data.xlsx", document_config: { some: :settings })
xformer = ColumnSetTransformer.new

source.each do |row|
  xformer.process(row)
end

p xformer.col_set # col_set must be attr_reader on this class.

现在，我的数据得到了轻松改造：）

使用Kiba-ETL将表转换为集合的散列

更新 - 部分解决。

2 个答案:

缓冲目的地

使用实例变量聚合+ post_process块

很快可能：使用缓冲转换