{featuretools}

时间:2018-10-30 14:54:30

标签: python machine-learning feature-engineering featuretools

当我尝试创建实体集之间的关系(使用我自己的数据)时遇到了一个问题。没有错误,但是它只是没有为我的一个实体(“ prods”实体)创建功能,尽管一切都应该很好地连接。

我无法共享数据,但创建了一个包含一些模拟数据的最小示例:

import pandas as pd
import featuretools as ft

创建模拟数据

cust = pd.DataFrame([[1,50],[2,60]], 
                    columns=['CUST_ID','AGE'])#

orders = pd.DataFrame([[1,1,50,33.0],[2,1,60,20],[3,2,66,999.9]], 
                      columns=['ORD_ID','CUST_ID','QTY','PRICE'])

order_items = pd.DataFrame([[1,1,1,2,3.0],[2,2,2,8,5.0],[3,2,1,2,3.0],[4,3,3,2,3.0]], 
                           columns=['ORD_ITM_ID','ORD_ID','PROD_ID','QTY','PRICE'])

prods = pd.DataFrame([[1,3.0],[2,5.0],[3,3.0]], 
                     columns=['PROD_ID','PRICE'])

定义实体集

es = ft.EntitySet('test')

## Adding Customers Entity

es.entity_from_dataframe(dataframe=cust,
                         entity_id='cust',
                         index='CUST_ID')

## Adding Orders Entity
es.entity_from_dataframe(dataframe=orders,
                         entity_id='orders',
                         index='ORD_ID')

## Adding Order Items Entity
es.entity_from_dataframe(dataframe=order_items,
                         entity_id='order_items',
                         index='ORD_ITM_ID')

## Adding Products Entity
es.entity_from_dataframe(dataframe=prods,
                         entity_id='prods',
                         index='PROD_ID')

创建关系

customer_relationship = ft.Relationship(es["cust"]["CUST_ID"],
                                   es["orders"]["CUST_ID"])


orderitems_relationship = ft.Relationship(es["orders"]["ORD_ID"], 
                                          es["order_items"]["ORD_ID"])


products_relationship = ft.Relationship(es["prods"]["PROD_ID"],
                                        es["order_items"]["PROD_ID"])

### Add Relationships
es = es.add_relationship(customer_relationship)
es = es.add_relationship(orderitems_relationship)
es = es.add_relationship(products_relationship)

生成功能

feature_defs = ft.dfs(entityset=es,
                                target_entity="cust",
                                agg_primitives=["count", "sum"],
                                verbose = True, 
                                features_only = True)
## Show features
feature_defs

输出:

Built 7 features
[<Feature: AGE>,
 <Feature: COUNT(order_items)>,
 <Feature: SUM(orders.QTY)>,
 <Feature: SUM(orders.PRICE)>,
 <Feature: SUM(order_items.QTY)>,
 <Feature: COUNT(orders)>,
 <Feature: SUM(order_items.PRICE)>]

这也应该向我展示产品变量的功能,但事实并非如此。

所以我希望SUM将每个客户的产品价格加起来。相反,什么都没有。

最终,我想为有趣的值创建功能。但是由于没有显示产品变量,因此添加有趣的值也不起作用。

## Get All Product IDs
interesting_products = es["prods"].df.PROD_ID.unique()

es["prods"]["PROD_ID"].interesting_values=interesting_products


feature_defs = ft.dfs(entityset=es,
                                target_entity="cust",
                                agg_primitives=["count", "sum"],
                                where_primitives=["count", "sum"],
                                verbose = True, 
                                features_only = True)
## Show features
feature_defs

输出:

Built 7 features
[<Feature: AGE>,
 <Feature: COUNT(order_items)>,
 <Feature: SUM(orders.QTY)>,
 <Feature: SUM(orders.PRICE)>,
 <Feature: SUM(order_items.QTY)>,
 <Feature: COUNT(orders)>,
 <Feature: SUM(order_items.PRICE)>]

希望有人可以提供帮助:)

1 个答案:

答案 0 :(得分:1)

之所以没有显示产品,是因为从该产品创建的任何特征都将是深度3。您可以使用ft.dfs参数像这样在max_depth中控制深度

feature_defs = ft.dfs(entityset=es,
                      target_entity="cust",
                      agg_primitives=["count", "sum"],
                      verbose = True, 
                      max_depth=3, # add max_depth
                      features_only = True)

现在返回的功能是

[<Feature: AGE>,
 <Feature: SUM(order_items.QTY)>,
 <Feature: SUM(order_items.PRICE)>,
 <Feature: SUM(orders.PRICE)>,
 <Feature: SUM(orders.QTY)>,
 <Feature: COUNT(order_items)>,
 <Feature: COUNT(orders)>,
 <Feature: SUM(order_items.prods.PRICE)>]

使用产品价格,您最终可以看到SUM(order_items.prods.PRICE)

要使where子句起作用,请将有趣的值添加到order_items实体中。

interesting_products = es["prods"].df.PROD_ID.unique()
es["order_items"]["PROD_ID"].interesting_values=interesting_products
feature_defs = ft.dfs(entityset=es,
                      target_entity="cust",
                      agg_primitives=["count", "sum"],
                      where_primitives=["count", "sum"],
                      verbose=True, 
                      max_depth=3, 
                      features_only=True)

这将创建20个功能,您可以在下面看到

[<Feature: AGE>,
 <Feature: SUM(order_items.QTY)>,
 <Feature: SUM(order_items.PRICE)>,
 <Feature: SUM(orders.PRICE)>,
 <Feature: SUM(orders.QTY)>,
 <Feature: COUNT(order_items)>,
 <Feature: COUNT(orders)>,
 <Feature: SUM(order_items.prods.PRICE WHERE PROD_ID = 2)>,
 <Feature: SUM(order_items.QTY WHERE PROD_ID = 2)>,
 <Feature: SUM(order_items.QTY WHERE PROD_ID = 3)>,
 <Feature: SUM(order_items.prods.PRICE)>,
 <Feature: COUNT(order_items WHERE PROD_ID = 2)>,
 <Feature: SUM(order_items.prods.PRICE WHERE PROD_ID = 1)>,
 <Feature: SUM(order_items.PRICE WHERE PROD_ID = 3)>,
 <Feature: COUNT(order_items WHERE PROD_ID = 1)>,
 <Feature: COUNT(order_items WHERE PROD_ID = 3)>,
 <Feature: SUM(order_items.prods.PRICE WHERE PROD_ID = 3)>,
 <Feature: SUM(order_items.QTY WHERE PROD_ID = 1)>,
 <Feature: SUM(order_items.PRICE WHERE PROD_ID = 2)>,
 <Feature: SUM(order_items.PRICE WHERE PROD_ID = 1)>]