如何使用DFS在Featuretools中制作“堆叠”功能

时间:2019-11-14 22:54:03

标签: python featuretools feature-engineering

阅读文档,向上调整max_depth会导致复杂的“堆叠”功能。

我发现将max_depth调整为2后产生的功能没有区别。

我在做什么错了?

max_depth = 1:原始功能

feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity='fish',
                                  max_depth=1)

features

>>>[<Feature: sex>,
 <Feature: length>,
 <Feature: diameter>,
 <Feature: height>,
 <Feature: whole_weight>,
 <Feature: shucked_weight>,
 <Feature: viscera_weight>,
 <Feature: shell_weight>]

max_depth = 2:基本基元

feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity='fish',
                                  max_depth=2)

features

>>>[<Feature: sex>,
 <Feature: length>,
 <Feature: diameter>,
 <Feature: height>,
 <Feature: whole_weight>,
 <Feature: shucked_weight>,
 <Feature: viscera_weight>,
 <Feature: shell_weight>,
 <Feature: sex_adult.SUM(fish.shell_weight)>,
 <Feature: sex_adult.SUM(fish.viscera_weight)>,
 <Feature: sex_adult.SUM(fish.shucked_weight)>,
 <Feature: sex_adult.SUM(fish.length)>,
 <Feature: sex_adult.SUM(fish.diameter)>,
 <Feature: sex_adult.SUM(fish.whole_weight)>,
 <Feature: sex_adult.SUM(fish.height)>,
 <Feature: sex_adult.STD(fish.shell_weight)>,
 <Feature: sex_adult.STD(fish.viscera_weight)>,
 <Feature: sex_adult.STD(fish.shucked_weight)>,
 <Feature: sex_adult.STD(fish.length)>,
 <Feature: sex_adult.STD(fish.diameter)>,
 <Feature: sex_adult.STD(fish.whole_weight)>,
 <Feature: sex_adult.STD(fish.height)>,
 <Feature: sex_adult.MAX(fish.shell_weight)>,
 <Feature: sex_adult.MAX(fish.viscera_weight)>,
 <Feature: sex_adult.MAX(fish.shucked_weight)>,
 <Feature: sex_adult.MAX(fish.length)>,
 <Feature: sex_adult.MAX(fish.diameter)>,
 <Feature: sex_adult.MAX(fish.whole_weight)>,
 <Feature: sex_adult.MAX(fish.height)>,
 <Feature: sex_adult.SKEW(fish.shell_weight)>,
 <Feature: sex_adult.SKEW(fish.viscera_weight)>,
 <Feature: sex_adult.SKEW(fish.shucked_weight)>,
 <Feature: sex_adult.SKEW(fish.length)>,
 <Feature: sex_adult.SKEW(fish.diameter)>,
 <Feature: sex_adult.SKEW(fish.whole_weight)>,
 <Feature: sex_adult.SKEW(fish.height)>,
 <Feature: sex_adult.MIN(fish.shell_weight)>,
 <Feature: sex_adult.MIN(fish.viscera_weight)>,
 <Feature: sex_adult.MIN(fish.shucked_weight)>,
 <Feature: sex_adult.MIN(fish.length)>,
 <Feature: sex_adult.MIN(fish.diameter)>,
 <Feature: sex_adult.MIN(fish.whole_weight)>,
 <Feature: sex_adult.MIN(fish.height)>,
 <Feature: sex_adult.MEAN(fish.shell_weight)>,
 <Feature: sex_adult.MEAN(fish.viscera_weight)>,
 <Feature: sex_adult.MEAN(fish.shucked_weight)>,
 <Feature: sex_adult.MEAN(fish.length)>,
 <Feature: sex_adult.MEAN(fish.diameter)>,
 <Feature: sex_adult.MEAN(fish.whole_weight)>,
 <Feature: sex_adult.MEAN(fish.height)>,
 <Feature: sex_adult.COUNT(fish)>]

max_depth = 3:与max_depth = 2相同的功能

feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity='fish',
                                  max_depth=3)

features

>>>[<Feature: sex>,
 <Feature: length>,
 <Feature: diameter>,
 <Feature: height>,
 <Feature: whole_weight>,
 <Feature: shucked_weight>,
 <Feature: viscera_weight>,
 <Feature: shell_weight>,
 <Feature: sex_adult.SUM(fish.shell_weight)>,
 <Feature: sex_adult.SUM(fish.viscera_weight)>,
 <Feature: sex_adult.SUM(fish.shucked_weight)>,
 <Feature: sex_adult.SUM(fish.length)>,
 <Feature: sex_adult.SUM(fish.diameter)>,
 <Feature: sex_adult.SUM(fish.whole_weight)>,
 <Feature: sex_adult.SUM(fish.height)>,
 <Feature: sex_adult.STD(fish.shell_weight)>,
 <Feature: sex_adult.STD(fish.viscera_weight)>,
 <Feature: sex_adult.STD(fish.shucked_weight)>,
 <Feature: sex_adult.STD(fish.length)>,
 <Feature: sex_adult.STD(fish.diameter)>,
 <Feature: sex_adult.STD(fish.whole_weight)>,
 <Feature: sex_adult.STD(fish.height)>,
 <Feature: sex_adult.MAX(fish.shell_weight)>,
 <Feature: sex_adult.MAX(fish.viscera_weight)>,
 <Feature: sex_adult.MAX(fish.shucked_weight)>,
 <Feature: sex_adult.MAX(fish.length)>,
 <Feature: sex_adult.MAX(fish.diameter)>,
 <Feature: sex_adult.MAX(fish.whole_weight)>,
 <Feature: sex_adult.MAX(fish.height)>,
 <Feature: sex_adult.SKEW(fish.shell_weight)>,
 <Feature: sex_adult.SKEW(fish.viscera_weight)>,
 <Feature: sex_adult.SKEW(fish.shucked_weight)>,
 <Feature: sex_adult.SKEW(fish.length)>,
 <Feature: sex_adult.SKEW(fish.diameter)>,
 <Feature: sex_adult.SKEW(fish.whole_weight)>,
 <Feature: sex_adult.SKEW(fish.height)>,
 <Feature: sex_adult.MIN(fish.shell_weight)>,
 <Feature: sex_adult.MIN(fish.viscera_weight)>,
 <Feature: sex_adult.MIN(fish.shucked_weight)>,
 <Feature: sex_adult.MIN(fish.length)>,
 <Feature: sex_adult.MIN(fish.diameter)>,
 <Feature: sex_adult.MIN(fish.whole_weight)>,
 <Feature: sex_adult.MIN(fish.height)>,
 <Feature: sex_adult.MEAN(fish.shell_weight)>,
 <Feature: sex_adult.MEAN(fish.viscera_weight)>,
 <Feature: sex_adult.MEAN(fish.shucked_weight)>,
 <Feature: sex_adult.MEAN(fish.length)>,
 <Feature: sex_adult.MEAN(fish.diameter)>,
 <Feature: sex_adult.MEAN(fish.whole_weight)>,
 <Feature: sex_adult.MEAN(fish.height)>,
 <Feature: sex_adult.COUNT(fish)>]

1 个答案:

答案 0 :(得分:2)

为什么增加max_depth不会增加创建的要素数量?

从功能列表中脱颖而出的一件事是创建的每个新功能  是聚合类型的原语(最大值,平均值等)。没有使用转换类型原语创建新功能。

不看实体集的架构,我只能猜测,但是鱼实体上的所有变量似乎都是数字的(长度,直径,高度,重量等)或分类的(性别) 。进行了dfs次呼叫,

feature_matrix, features = ft.dfs(entityset=es,
                                  target_entity='fish',
                                  max_depth=2)

不使用trans_primitives选项,因此DFS在尝试创建新功能时将使用默认的一组变换原语。默认的一组转换原语不包含任何可应用于数字或分类变量的原语,因此没有新的转换功能。

我创建了一个模拟实体集来尝试复制这种情况:

import featuretools as ft
import numpy as np
import pandas as pd

fish = pd.DataFrame({
    "sex": np.random.choice(['F', 'M'], size=10),
    "length": np.random.sample(size=10),
    "weight": np.random.sample(size=10)
})

es = ft.EntitySet("fish")
es.entity_from_dataframe(entity_id="fish",
                         make_index=True,
                         index="id",
                         dataframe=fish)
es.normalize_entity(base_entity_id='fish',
                    new_entity_id='sex_adult',
                    index='sex')

我还只使用聚合原语创建了新功能。

ft.dfs(entityset=es,
       target_entity='fish',
       max_depth=2,
       features_only=True)

>>>[<Feature: sex>,
 <Feature: length>,
 <Feature: weight>,
 <Feature: sex_adult.SUM(fish.length)>,
 <Feature: sex_adult.SUM(fish.weight)>,
 <Feature: sex_adult.STD(fish.length)>,
 <Feature: sex_adult.STD(fish.weight)>,
 <Feature: sex_adult.MAX(fish.length)>,
 <Feature: sex_adult.MAX(fish.weight)>,
 <Feature: sex_adult.SKEW(fish.length)>,
 <Feature: sex_adult.SKEW(fish.weight)>,
 <Feature: sex_adult.MIN(fish.length)>,
 <Feature: sex_adult.MIN(fish.weight)>,
 <Feature: sex_adult.MEAN(fish.length)>,
 <Feature: sex_adult.MEAN(fish.weight)>,
 <Feature: sex_adult.COUNT(fish)>]

max_depth增加到3或更多不会创建更多功能。但是,一旦我使用trans_primitives选项添加了Percentile转换原语(可以将其应用于数字类型值),就会得到不同的结果。

ft.dfs(entityset=es,
       target_entity='fish',
       max_depth=2,
       trans_primitives=[ft.primitives.Percentile],
       features_only=True)

>>>[<Feature: sex>,
 <Feature: length>,
 <Feature: weight>,
 <Feature: PERCENTILE(length)>,
 <Feature: PERCENTILE(weight)>,
 <Feature: sex_adult.SUM(fish.length)>,
 <Feature: sex_adult.SUM(fish.weight)>,
 <Feature: sex_adult.STD(fish.length)>,
 <Feature: sex_adult.STD(fish.weight)>,
 <Feature: sex_adult.MAX(fish.length)>,
 <Feature: sex_adult.MAX(fish.weight)>,
 <Feature: sex_adult.SKEW(fish.length)>,
 <Feature: sex_adult.SKEW(fish.weight)>,
 <Feature: sex_adult.MIN(fish.length)>,
 <Feature: sex_adult.MIN(fish.weight)>,
 <Feature: sex_adult.MEAN(fish.length)>,
 <Feature: sex_adult.MEAN(fish.weight)>,
 <Feature: sex_adult.COUNT(fish)>]

两个新功能,Percentile(length)Percentile(weight)。将max_depth增加到3可增加更多功能。

ft.dfs(entityset=es,
       target_entity='fish',
       max_depth=3,
       trans_primitives=[ft.primitives.Percentile],
       features_only=True)

>[<Feature: sex>,
 <Feature: length>,
 <Feature: weight>,
 <Feature: PERCENTILE(length)>,
 <Feature: PERCENTILE(weight)>,
 <Feature: sex_adult.SUM(fish.length)>,
 <Feature: sex_adult.SUM(fish.weight)>,
 <Feature: sex_adult.STD(fish.length)>,
 <Feature: sex_adult.STD(fish.weight)>,
 <Feature: sex_adult.MAX(fish.length)>,
 <Feature: sex_adult.MAX(fish.weight)>,
 <Feature: sex_adult.SKEW(fish.length)>,
 <Feature: sex_adult.SKEW(fish.weight)>,
 <Feature: sex_adult.MIN(fish.length)>,
 <Feature: sex_adult.MIN(fish.weight)>,
 <Feature: sex_adult.MEAN(fish.length)>,
 <Feature: sex_adult.MEAN(fish.weight)>,
 <Feature: sex_adult.COUNT(fish)>,
 <Feature: sex_adult.SUM(fish.PERCENTILE(length))>,
 <Feature: sex_adult.SUM(fish.PERCENTILE(weight))>,
 <Feature: sex_adult.STD(fish.PERCENTILE(length))>,
 <Feature: sex_adult.STD(fish.PERCENTILE(weight))>,
 <Feature: sex_adult.MAX(fish.PERCENTILE(length))>,
 <Feature: sex_adult.MAX(fish.PERCENTILE(weight))>,
 <Feature: sex_adult.SKEW(fish.PERCENTILE(length))>,
 <Feature: sex_adult.SKEW(fish.PERCENTILE(weight))>,
 <Feature: sex_adult.MIN(fish.PERCENTILE(length))>,
 <Feature: sex_adult.MIN(fish.PERCENTILE(weight))>,
 <Feature: sex_adult.MEAN(fish.PERCENTILE(length))>,
 <Feature: sex_adult.MEAN(fish.PERCENTILE(weight))>,
 <Feature: sex_adult.PERCENTILE(MAX(fish.length))>,
 <Feature: sex_adult.PERCENTILE(SUM(fish.length))>,
 <Feature: sex_adult.PERCENTILE(MAX(fish.weight))>,
 <Feature: sex_adult.PERCENTILE(SKEW(fish.length))>,
 <Feature: sex_adult.PERCENTILE(MIN(fish.length))>,
 <Feature: sex_adult.PERCENTILE(MIN(fish.weight))>,
 <Feature: sex_adult.PERCENTILE(MEAN(fish.weight))>,
 <Feature: sex_adult.PERCENTILE(STD(fish.weight))>,
 <Feature: sex_adult.PERCENTILE(COUNT(fish))>,
 <Feature: sex_adult.PERCENTILE(STD(fish.length))>,
 <Feature: sex_adult.PERCENTILE(SUM(fish.weight))>,
 <Feature: sex_adult.PERCENTILE(SKEW(fish.weight))>,
 <Feature: sex_adult.PERCENTILE(MEAN(fish.length))>]>

但是,将max_depth增加到4以上不会创建更多的附加功能。 DFS遵循的规则没有创建更多功能。但是通常,通过添加更多的原语,实体和数据类型,可以有更多的组合,这些组合可以导致更多的这些“堆叠”功能。