trans_primitives不会生成到datetime列

时间:2019-06-25 11:22:45

标签: python featuretools

我正在创建一个Featuretools矩阵,该矩阵是从5个数据帧实体和一个cutoff_time表生成的。使用ft.dfs()函数时,同时使用 agg_primitives trans_primitives ,但是trans_primitives中与日期时间列相关的所有原语均不生成任何功能。

包含datetime列的实体称为“事件”。列的名称为“事件时间戳”。

由于我的trans_primitives列表包括其他确实会生成特征的原语(“ IS_NULL”有效),因此我认为问题不在于我如何整体使用trans_primitives,而只是与时间有关的问题。

可能有帮助的一些事情:

  1. 我检查了'events'中'event-timestamp'列的dtype,它是datetime64 [ns]。 截止表中的“ cutoff_time”列也是如此。

  2. 另一个细节是, agg_primitives 生成了“ event-timestamp”的一些新功能(例如,“ MIN(matcher.devices.TIME_SINCE_LAST(events.event-timestamp))” '),所以我想它表明该列本身是可以的。

  3. 我对“事件”的es.entity_from_dataframe做了一些实验:

    • 使用了参数:time_index ='event-timestamp'
    • 使用了参数:variable_types = {'event-timestamp':vtypes.Datetime}
    • 同时使用了以上两项和全部

以下是我正在使用的功能:

def generate_feature_matrix(events, grns, contracts, om_table, matcher, customers):
    """
    The function takes a set of tables, creates featuretools entities and 
    relationships and then creates the final feature matrix"""


    ## Make empty entityset
    es = ft.EntitySet(id = 'contracts_customers')


    ## Create entities
    # events
    es.entity_from_dataframe(entity_id='events', dataframe=events, index='index', make_index=True,
                             time_index='event-timestamp') # tried also variable_types={'event-timestamp': vtypes.DatetimeTimeIndex} 
    # Devices
    es.entity_from_dataframe(entity_id='contracts', dataframe=contracts, index='contract')
    # Matcher
    es.entity_from_dataframe(entity_id='matcher', dataframe=matcher, index = 'contract', 
                             make_index=False)
    # os_table
    es.entity_from_dataframe(entity_id='om_table', dataframe=om_table, index='index', 
                             make_index=True)
    # Users
    es.entity_from_dataframe(entity_id='customers', dataframe=customers, index='customer')




    # Relationships (parent, child)
    r_devices_matcher = ft.Relationship(es['contracts']['contract'], es['matcher']['contract'])
    r_devices_events = ft.Relationship(es['contracts']['contract'], es['events']['contract'])
    r_devices_os = ft.Relationship(es['contracts']['contract'], es['om_table']['contract'])
    r_users_matcher = ft.Relationship(es['customers']['customer'], es['matcher']['customer'])

    es.add_relationships([r_devices_matcher, r_devices_events, r_users_matcher, r_devices_os])

    # Primitives
    agg_primitives=["num_unique", "skew", "mean", "count", "median", "sum",
                    "time_since_last", "mode", "min"] 

    trans_primitives=['month', 'weekday','hour', "time_since", "time_since_previous",
                      'is_null']

    # Generate the features
    feature_defs = ft.dfs(entityset=es, target_entity='customers', 
                                          cutoff_time = grns, 
                                          agg_primitives = agg_primitives,
                                          trans_primitives = trans_primitives,
                                          max_depth = 3, features_only = True,
                                          chunk_size = len(grns),  
                                          )



    return feature_defs

实体关系如下:

os
Out[392]: 
Entityset: contracts_customers
  Entities:
    events [Rows: 22, Columns: 3]
    contracts [Rows: 35, Columns: 2]
    matcher [Rows: 2663, Columns: 2]
    om_table [Rows: 965, Columns: 4]
    customers [Rows: 76, Columns: 2]
  Relationships:
    matcher.contract -> contracts.contract
    events.contract -> contracts.contract
    matcher.customer -> customers.customer
    om_table.contract -> contracts.contract

以及生成的功能列表:

new_features
Out[393]: 
[<Feature: n_contracts>,
 <Feature: NUM_UNIQUE(matcher.contract)>,
 <Feature: MODE(matcher.contract)>,
 <Feature: IS_NULL(customer)>,
 <Feature: IS_NULL(n_contracts)>,
 <Feature: SKEW(matcher.contracts.n_event)>,
 <Feature: MEAN(matcher.contracts.n_event)>,
 <Feature: MEDIAN(matcher.contracts.n_event)>,
 <Feature: SUM(matcher.contracts.n_event)>,
 <Feature: MIN(matcher.contracts.n_event)>,
 <Feature: IS_NULL(NUM_UNIQUE(matcher.contract))>,
 <Feature: IS_NULL(MODE(matcher.contract))>,
 <Feature: NUM_UNIQUE(matcher.contracts.MODE(matcher.customer))>,
 <Feature: NUM_UNIQUE(matcher.contracts.MODE(om_table.om_family))>,
 <Feature: SKEW(matcher.contracts.COUNT(events))>,
 <Feature: SKEW(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: SKEW(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: SKEW(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: SKEW(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.COUNT(om_table))>,
 <Feature: SKEW(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: SKEW(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.COUNT(events))>,
 <Feature: MEAN(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: MEAN(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: MEAN(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: MEAN(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.COUNT(om_table))>,
 <Feature: MEAN(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: MEAN(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.COUNT(events))>,
 <Feature: MEDIAN(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: MEDIAN(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: MEDIAN(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: MEDIAN(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.COUNT(om_table))>,
 <Feature: MEDIAN(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: MEDIAN(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.COUNT(events))>,
 <Feature: SUM(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: SUM(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: SUM(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: SUM(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.COUNT(om_table))>,
 <Feature: SUM(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: SUM(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: MODE(matcher.contracts.MODE(matcher.customer))>,
 <Feature: MODE(matcher.contracts.MODE(om_table.om_family))>,
 <Feature: MIN(matcher.contracts.COUNT(events))>,
 <Feature: MIN(matcher.contracts.TIME_SINCE_LAST(events.event-timestamp))>,
 <Feature: MIN(matcher.contracts.NUM_UNIQUE(matcher.customer))>,
 <Feature: MIN(matcher.contracts.NUM_UNIQUE(om_table.om_family))>,
 <Feature: MIN(matcher.contracts.SKEW(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.MEAN(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.COUNT(om_table))>,
 <Feature: MIN(matcher.contracts.MEDIAN(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.SUM(om_table.n_events))>,
 <Feature: MIN(matcher.contracts.MIN(om_table.n_events))>,
 <Feature: IS_NULL(SKEW(matcher.contracts.n_event))>,
 <Feature: IS_NULL(MEAN(matcher.contracts.n_event))>,
 <Feature: IS_NULL(MEDIAN(matcher.contracts.n_event))>,
 <Feature: IS_NULL(SUM(matcher.contracts.n_event))>,
 <Feature: IS_NULL(MIN(matcher.contracts.n_event))>]

我希望可以从 all 上面的trans_primitives列表中获得新功能。

1 个答案:

答案 0 :(得分:0)

它对es.plot()中'event-timestamp'列的变量类型说什么?根据您所说的“ time_since_last”,我怀疑这是问题所在。

此外,当您将目标实体从“客户”更改为“事件”时,问题仍然存在吗?在没有看到模式的情况下很难准确地分辨出来,但是我猜想EntitySet中的“事件”和“客户”没有某种关联,以致于基元正在计算所需的功能。尝试更改目标实体并查看创建的功能。如果仍然没有日期时间trans_primitives,那么这与我在想的是一个不同的问题。

编辑: 复制类似的行为:

import featuretools as ft
from featuretools.tests.testing_utils import make_ecommerce_entityset

es = make_ecommerce_entityset()
es.plot()

features = ft.dfs(entityset=es,
                  target_entity="stores",
                  features_only=True,
                  max_depth=3)

features

与“同类群组”相关的功能是:

<Feature: régions.MODE(customers.cohorts.cohort_name)>
<Feature: régions.NUM_UNIQUE(customers.cohorts.cohort_name)>,

请注意,此处,基元也不会应用于同类群组的值以生成新特征。

我认为正在发生的事情是事件和客户之间的联系太间接了。 customerscontracts共享子项matcher,而eventscontracts的子项。在上面的示例中,发生这种情况时,它不会为这些实体计算新功能。

我认为定义的行为是将原语应用于目标实体和直接相关的实体。在这里,由于实体之间的关系太间接(如果您看一下上面的示例,sessions的计算也不如cohorts),因此在您增加max_depth之前,图元不会应用于其值