我有一个简单的实体集parent1 <- child -> parent2
,并且需要使用截止数据帧。我的目标是parent1
,随时可以进行预测。我只想为date
指定一个parent2
列,以便可以将此time
信息加入到child
中。这种方式无法正常工作,我从parent1-child
实体的第一级功能中泄漏了数据。我唯一能做的就是也将date
列复制到child
。是否可以规范child
而不使用date
列?
示例。假设我们有3个实体。拳击手信息(带“名称”的父母1),比赛信息(带“国家”的父母2)及其组合(在一个特定的比赛中带有“ n_hits”的孩子):
import featuretools as ft
import pandas as pd
players = pd.DataFrame({"player_id": [1, 2, 3], "player_name": ["Oleg", "Kirill", "Max"]})
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50]})
matches = pd.DataFrame({
"match_id": [11, 12], "match_date": pd.to_datetime(['2014-1-10', '2014-1-20']),
"country": ["Russia", "Germany"]})
es = ft.EntitySet()
es.entity_from_dataframe(
entity_id="players", dataframe=players,
index="player_id",
variable_types={"player_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
es = es.entity_from_dataframe(
entity_id="matches", dataframe=matches,
index="match_id",
time_index="match_date",
variable_types={"match_id": ft.variable_types.Categorical})
es = es.add_relationship(ft.Relationship(es["players"]["player_id"],
es["player_stats"]["player_id"]))
es = es.add_relationship(ft.Relationship(es["matches"]["match_id"],
es["player_stats"]["match_id"]))
在这里,我要使用1月15日的所有可用信息。因此,唯一合法的是第一场比赛的信息,而不是第二场比赛的信息。
cutoff_df = pd.DataFrame({
"player_id":[1, 2, 3],
"match_date": pd.to_datetime(['2014-1-15', '2014-1-15', '2014-1-15'])})
fm, features = ft.dfs(entityset=es, target_entity='players', cutoff_time=cutoff_df,
cutoff_time_in_index=True, agg_primitives = ["mean"])
fm
我知道了
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 30
2 2014-01-15 Kirill 30
3 2014-01-15 Max 50
我知道为match_date
设置适当的player_stats
的唯一方法是将来自matches
的信息加入
player_stats = pd.DataFrame({
"match_player_id": [101, 102, 103, 104], "player_id": [1, 2, 1, 3],
"match_id": [11, 11, 12, 12], "n_hits": [20, 30, 40, 50],
"match_date": pd.to_datetime(
['2014-1-10', '2014-1-10', '2014-1-20', '2014-1-20']) ## a result of join
})
...
es = es.entity_from_dataframe(
entity_id="player_stats", dataframe=player_stats,
index="match_player_id",
time_index="match_date", ## a change here too
variable_types={"match_player_id": ft.variable_types.Categorical,
"player_id": ft.variable_types.Categorical,
"match_id": ft.variable_types.Categorical})
我得到了预期的结果
player_name MEAN(player_stats.n_hits)
player_id time
1 2014-01-15 Oleg 20.0
2 2014-01-15 Kirill 30.0
3 2014-01-15 Max NaN
答案 0 :(得分:0)
对于实体的时间索引,Featuretools非常保守。如果没有提供时间索引,我们将尽量不进行推断。因此,您必须按照建议创建重复列。