如何记录在DataFrame上调用转换的时刻?

时间:2016-10-20 14:18:04

标签: logging apache-spark spark-dataframe rdd apache-spark-mllib

我正在构建一个ML管道,它从DataFrame中提取功能,我希望它的行为如下:

  • 记录“提取要素1”
  • 提取要素1
  • 记录“提取功能2”
  • 提取要素2
  • ...
  • 记录“提取功能n”
  • 提取功能n

事情是,转变是懒惰的,我最终得到以下结论:

  • 记录“提取要素1”
  • 记录“提取功能2”
  • 记录“提取功能n”
  • 提取要素1
  • 提取要素2
  • ...
  • 提取功能n

我的转换方法看起来有点像:

override def transform(dataset: DataFrame): DataFrame = {
   require(featuresToExtract.size > 0, "You must provide at least one feature to extract to use this FeatureExtractorTransformer")

   var joinedDataFrame = extract(dataset, featuresToExtract head)

   for (featureToExtract <- featuresToExtract.tail) {
     // LOGGING HERE THAT I WANT CALLED JUST BEFORE THE CORRESPONDING ACTION
     joinedDataFrame = joinedDataFrame.join(extract(dataset, featureToExtract), joinOn, "outer")
   }
   joinedDataFrame
}

所以关于如何进行的任何想法?

由于

0 个答案:

没有答案