避免为子实体重复日期列

时间:2019-02-25 18:29:33

标签: featuretools

我有一个简单的实体集parent1 <- child -> parent2,并且需要使用截止数据帧。我的目标是parent1,随时可以进行预测。我只想为date指定一个parent2列,以便可以将此time信息加入到child中。这种方式无法正常工作,我从parent1-child实体的第一级功能中泄漏了数据。我唯一能做的就是也将date列复制到child。是否可以规范child而不使用date列?

示例。假设我们有3个实体。拳击手信息(带“名称”的父母1),比赛信息(带“国家”的父母2)及其组合(在一个特定的比赛中带有“ n_hits”的孩子):

import featuretools as ft
import pandas as pd

players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
    "match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3], 
    "match_id":        [11, 11, 12, 12],     "n_hits":    [20, 30, 40, 50]})
matches = pd.DataFrame({
    "match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
    "country": ["Russia", "Germany"]})

es = ft.EntitySet()
es.entity_from_dataframe(
    entity_id="players", dataframe=players,
    index="player_id",
    variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
    entity_id="player_stats", dataframe=player_stats,
    index="match_player_id",
    variable_types={"match_player_id": ft.variable_types.Categorical,
                    "player_id": ft.variable_types.Categorical,
                    "match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
    entity_id="matches", dataframe=matches,
    index="match_id",
    time_index="match_date",
    variable_types={"match_id": ft.variable_types.Categorical})

es = es.add_relationship(ft.Relationship(es["players"]["player_id"], 
                                         es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"], 
                                         es["player_stats"]["match_id"]))

在这里,我要使用1月15日的所有可用信息。因此,唯一合法的是第一场比赛的信息,而不是第二场比赛的信息。

cutoff_df = pd.DataFrame({
  "player_id":[1, 2, 3], 
  "match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})

fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df, 
                      cutoff_time_in_index=True, agg_primitives = ["mean"])
fm

我知道了

                     player_name  MEAN(player_stats.n_hits)
player_id time                                             
1         2014-01-15        Oleg                         30
2         2014-01-15      Kirill                         30
3         2014-01-15         Max                         50

我知道为match_date设置适当的player_stats的唯一方法是将来自matches的信息加入

player_stats = pd.DataFrame({
    "match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3], 
    "match_id":        [11, 11, 12, 12],     "n_hits":    [20, 30, 40, 50],
    "match_date": pd.to_datetime(
       ['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
    entity_id="player_stats", dataframe=player_stats,
    index="match_player_id",
    time_index="match_date",  ## a change here too
    variable_types={"match_player_id": ft.variable_types.Categorical,
                    "player_id": ft.variable_types.Categorical,
                    "match_id": ft.variable_types.Categorical})

我得到了预期的结果

                     player_name  MEAN(player_stats.n_hits)
player_id time                                             
1         2014-01-15        Oleg                       20.0
2         2014-01-15      Kirill                       30.0
3         2014-01-15         Max                        NaN

1 个答案:

答案 0 :(得分:0)

对于实体的时间索引,Featuretools非常保守。如果没有提供时间索引,我们将尽量不进行推断。因此,您必须按照建议创建重复列。