我正在尝试根据较早的结果来预测足球比赛的结果。我正在Windows上运行Python 3.6,并使用Featuretools 0.4.1。
假设我有以下表示结果历史记录的数据框。
使用上面的数据框,我想创建以下数据框,该数据框将作为 X 馈入机器学习算法。请注意,尽管过去有比赛场地,但主队和客队的目标平均值仍需按球队计算。有没有办法使用Featuretools创建这样的数据框?
可用于模拟转换的Excel文件here。
答案 0 :(得分:2)
这是一个棘手的功能,但是在Featuretools中大量使用了自定义原语。
第一步是将匹配的CSV加载到Featuretools实体集中
hiDigits
然后,我们定义一个自定义转换原语,该原语计算最近n场比赛的平均进球数。它具有一个参数,该参数控制过去的比赛次数以及是否为主队或客队计算。有关定义自定义原语的信息,请参见我们的文档here和here。
es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
index="match_id",
time_index="match_date",
dataframe=matches_df)
现在,我们可以使用此原语定义特征。在这种情况下,我们将必须手动进行。
from featuretools.variable_types import Numeric, Categorical
from featuretools.primitives import make_trans_primitive
def avg_goals_previous_n_games(home_team, away_team, home_goals, away_goals, which_team=None, n=1):
# make dataframe so it's easier to work with
df = pd.DataFrame({
"home_team": home_team,
"away_team": away_team,
"home_goals": home_goals,
"away_goals": away_goals
})
result = []
for i, current_game in df.iterrows():
# get the right team for this game
team = current_game[which_team]
# find all previous games that have been played
prev_games = df.iloc[:i]
# only get games the team participated in
participated = prev_games[(prev_games["home_team"] == team) | (prev_games["away_team"] == team)]
if participated.shape[0] < n:
result.append(None)
continue
# get last n games
last_n = participated.tail(n)
# calculate games per game
goal_as_home = (last_n["home_team"] == team) * last_n["home_goals"]
goal_as_away = (last_n["away_team"] == team) * last_n["away_goals"]
# calculate mean across all home and away games
mean = (goal_as_home + goal_as_away).mean()
result.append(mean)
return result
# custom function so the name of the feature prints out correctly
def make_name(self):
return "%s_goal_last_%d" % (self.kwargs['which_team'], self.kwargs['n'])
AvgGoalPreviousNGames = make_trans_primitive(function=avg_goals_previous_n_games,
input_types=[Categorical, Categorical, Numeric, Numeric],
return_type=Numeric,
cls_attributes={"generate_name": make_name, "uses_full_entity":True})
最后,我们可以计算特征矩阵
input_vars = [es["matches"]["home_team"], es["matches"]["away_team"], es["matches"]["home_goals"], es["matches"]["away_goals"]]
home_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=1)
home_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=3)
home_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=5)
away_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=1)
away_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=3)
away_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=5)
features = [home_team_last1, home_team_last3, home_team_last5,
away_team_last1, away_team_last3, away_team_last5]
这将返回
fm = ft.calculate_feature_matrix(entityset=es, features=features)
最后,我们还可以将这些手动定义的特征用作使用深度特征综合的自动化特征工程的输入,这在here中进行了说明。通过将手动定义的功能传递为 home_team_goal_last_1 home_team_goal_last_3 home_team_goal_last_5 away_team_goal_last_1 away_team_goal_last_3 away_team_goal_last_5
match_id
1 NaN NaN NaN NaN NaN NaN
2 2.0 NaN NaN 0.0 NaN NaN
3 1.0 NaN NaN 0.0 NaN NaN
4 3.0 1.000000 NaN 0.0 1.000000 NaN
5 1.0 1.333333 NaN 1.0 0.666667 NaN
6 2.0 2.000000 1.2 0.0 0.333333 0.8
7 1.0 0.666667 0.6 2.0 1.666667 1.6
8 2.0 1.000000 0.8 2.0 2.000000 2.0
9 0.0 1.000000 0.8 1.0 1.666667 1.6
10 3.0 2.000000 2.0 1.0 1.000000 0.8
11 3.0 2.333333 2.2 1.0 0.666667 1.0
12 2.0 2.666667 2.2 2.0 1.333333 1.2
,seed_features
将自动堆叠在它们之上。
ft.dfs
fm, feature_defs = ft.dfs(entityset=es,
target_entity="matches",
seed_features=features,
agg_primitives=[],
trans_primitives=["day", "month", "year", "weekday", "percentile"])
是
feature_defs
特征矩阵为
[<Feature: home_team>,
<Feature: away_team>,
<Feature: home_goals>,
<Feature: away_goals>,
<Feature: label>,
<Feature: home_team_goal_last_1>,
<Feature: home_team_goal_last_3>,
<Feature: home_team_goal_last_5>,
<Feature: away_team_goal_last_1>,
<Feature: away_team_goal_last_3>,
<Feature: away_team_goal_last_5>,
<Feature: DAY(match_date)>,
<Feature: MONTH(match_date)>,
<Feature: YEAR(match_date)>,
<Feature: WEEKDAY(match_date)>,
<Feature: PERCENTILE(home_goals)>,
<Feature: PERCENTILE(away_goals)>,
<Feature: PERCENTILE(home_team_goal_last_1)>,
<Feature: PERCENTILE(home_team_goal_last_3)>,
<Feature: PERCENTILE(home_team_goal_last_5)>,
<Feature: PERCENTILE(away_team_goal_last_1)>,
<Feature: PERCENTILE(away_team_goal_last_3)>,
<Feature: PERCENTILE(away_team_goal_last_5)>]