Question

我正在尝试在neuraxle（0.5.2）中实现BaseStep，以过滤data_input（并相应过滤expected_output）。

class DataFrameQuery(NonFittableMixin, InputAndOutputTransformerMixin, BaseStep):
    def __init__(self, query):
        super().__init__()
        self.query = query
    
    def transform(self, data_input):
        data_input, expected_output = data_input
        # verify that input and output are either pd.DataFrame or pd.Series
        # ... [redacted] ...
        new_data_input = data_input.query(self.query)
        if all(output is None for output in expected_output):
            new_expected_output = [None] * len(new_data_input)
        else:
            new_expected_output = expected_output.loc[new_data_input.index]
        return new_data_input, new_expected_output

这自然（在大多数情况下）将导致len(data_inputs)（和expected_outputs）的变化。在neuraxle的最新版本中，我得到了AssertionError：

data_input = pd.DataFrame([{"A": 1, "B": 1}, {"A": 2, "B": 2}], index=[1, 2])
expected_output = pd.Series([1, 2], index=[1, 2])
pipeline = Pipeline([
    DataFrameQuery("A == 1")
])
pipeline.fit_transform(data_input, expected_output)

AssertionError: InputAndOutputTransformerMixin: 
    Caching broken because there is a different len of current ids, and data inputs. 
    Please use InputAndOutputTransformerWrapper if you plan to change the len of the data inputs.

据我了解，这就是Neuraxle的Handler Methods发挥作用的地方。但是，到目前为止，我还没有找到一种可以使我能够在转换后更新输入和输出的current_id的人（我猜应该是_did_transform，但是似乎没有。被呼叫）。

通常：

转换后（在同一步骤中）更新输入和预期输出的current_ids的正确方法是什么？
在data_container上应用副作用时应注意哪些方面？是标识符吗用于拆分数据以实现SIMD并行性？通常，新标识符是否应为整数序列？

编辑：我也曾尝试设置savers并按照here的说明使用InputAndOutputTransformerWrapper。仍然出现以下错误（可能是因为我不确定在哪里调用handle_transform）：

AssertionError: InputAndOutputTransformerWrapper: 
    Caching broken because there is a different len of current ids, and data inputs.
    Please resample the current ids using handler methods, or create new ones by setting the wrapped step saver to HashlibMd5ValueHasher using the BaseStep.set_savers method.

编辑：目前，我已经解决了以下问题：


class OutputShapeChangingStep(NonFittableMixin, InputAndOutputTransformerMixin, BaseStep):
    def __init__(self, idx):
        super().__init__()
        self.idx = idx
        
    def _update_data_container_shape(self, data_container):
        assert len(data_container.expected_outputs) == len(data_container.data_inputs)
        data_container.set_current_ids(range(len(data_container.data_inputs)))
        data_container = self.hash_data_container(data_container)
        return data_container
    
    def _set_data_inputs_and_expected_outputs(self, data_container, new_inputs, new_expected_outputs) -> DataContainer:
        data_container.set_data_inputs(new_inputs)
        data_container.set_expected_outputs(new_expected_outputs)
        data_container = self._update_data_container_shape(data_container)
        return data_container
    
    def transform(self, data_inputs):
        data_inputs, expected_outputs = data_inputs
        return data_inputs[self.idx], expected_outputs[self.idx]

在这种情况下，我很可能“错误地”覆盖了_set_data_inputs_and_expected_outputs的{{1}}（InputAndOutputTransformerMixin是更好的选择吗？），但是像这样更新_transform_data_container（并重新整理容器）。但是，我仍然对如何更符合Neuraxle的API期望更感兴趣。

Answer 1

个人而言，我最喜欢的方法是仅使用处理程序方法。我认为这要干净得多。

处理程序方法的用法示例：

class WindowTimeSeries(ForceHandleMixin, BaseTransformer):
   def __init__(self):
      BaseTransformer.__init__(self)
      ForceHandleMixin.__init__(self)

   def _transform_data_container(self, data_container: DataContainer, context: ExecutionContext) -> DataContainer:
      di = data_container.data_inputs
      new_di, new_eo = np.array_split(np.array(di), 2)

      return DataContainer(
        summary_id=data_container.summary_id,
        data_inputs=new_di,
        expected_outputs=new_eo
      )

这样，将重新创建当前ID，并使用默认行为对其进行哈希处理。注意：摘要ID是最重要的。它是在开始时创建的，并使用超参数进行了重新修饰...如果需要，您还可以使用自定义保护程序（例如HashlibMd5ValueHasher）来生成新的当前ID。

编辑，确实存在一个错误。该问题已在此处修复：https://github.com/Neuraxio/Neuraxle/pull/379

用法示例：

step = InputAndOutputTransformerWrapper(WindowTimeSeriesForOutputTransformerWrapper()) \
    .set_hashers([HashlibMd5ValueHasher()])
step = StepThatInheritsFromInputAndOutputTransformerMixin() \
     .set_hashers([HashlibMd5ValueHasher()])

如何正确实现过滤data_inputs的Neuraxle管道步骤？

1 个答案: