如何使FeatureUnion返回Dataframe

时间:2016-04-15 16:18:14

标签: python-2.7 machine-learning scikit-learn pipeline feature-extraction

所以我目前有一个拥有大量客户变压器的管道:

p = Pipeline([
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])

每个变换器都接收一个pandas数据帧,并返回带有一个或多个新列的相同数据帧。它实际上工作得很好,但我如何并行运行“GetTimeFromDate”和“GetZipFromAddress”步骤?

我想使用FeatureUnion:

f = FeatureUnion([  
("GetTimeFromDate",TimeTransformer("Date")), #Custom Transformer that adds ["time"] column
("GetZipFromAddress",ZipTransformer("Address")), #Custom Transformer that adds ["zip"] column])
])

p = Pipeline([
("FeatureUnionStep",f),
("GroupByTimeandZip",GroupByTransformer(["time","zip"]) #Custom Transformer that adds onehot columns
])

但问题是FeatureUnion返回一个numpy.ndarray,但“GroupByTimeandZip”步骤需要一个数据帧。

有没有办法让FeatureUnion返回pandas数据帧?

1 个答案:

答案 0 :(得分:1)

要让FeatureUnion输出DataFrame,可以使用此blog post中的PandasFeatureUnion。另请参见gist