如何使用Featuretools在单个数据框中按列值从多个列创建要素?

时间:2018-12-02 10:43:58

标签: python featuretools

我正在尝试根据较早的结果来预测足球比赛的结果。我正在Windows上运行Python 3.6,并使用Featuretools 0.4.1。

假设我有以下表示结果历史记录的数据框。

Original DataFame

使用上面的数据框,我想创建以下数据框,该数据框将作为 X 馈入机器学习算法。请注意,尽管过去有比赛场地,但主队和客队的目标平均值仍需按球队计算。有没有办法使用Featuretools创建这样的数据框?

Resulting Dataframe

可用于模拟转换的Excel文件here

1 个答案:

答案 0 :(得分:2)

这是一个棘手的功能,但是在Featuretools中大量使用了自定义原语。

第一步是将匹配的CSV加载到Featuretools实体集中

hiDigits

然后,我们定义一个自定义转换原语,该原语计算最近n场比赛的平均进球数。它具有一个参数,该参数控制过去的比赛次数以及是否为主队或客队计算。有关定义自定义原语的信息,请参见我们的文档herehere

es = ft.EntitySet()
matches_df = pd.read_csv("./matches.csv")
es.entity_from_dataframe(entity_id="matches",
                         index="match_id",
                         time_index="match_date",
                         dataframe=matches_df)

现在,我们可以使用此原语定义特征。在这种情况下,我们将必须手动进行。

from featuretools.variable_types import Numeric, Categorical
from featuretools.primitives import make_trans_primitive

def avg_goals_previous_n_games(home_team, away_team, home_goals, away_goals, which_team=None, n=1):
    # make dataframe so it's easier to work with
    df = pd.DataFrame({
        "home_team": home_team,
        "away_team": away_team,
        "home_goals": home_goals,
        "away_goals": away_goals
        })

    result = []
    for i, current_game in df.iterrows():
        # get the right team for this game
        team = current_game[which_team]

        # find all previous games that have been played
        prev_games =  df.iloc[:i]

        # only get games the team participated in
        participated = prev_games[(prev_games["home_team"] == team) | (prev_games["away_team"] == team)]
        if participated.shape[0] < n:
            result.append(None)
            continue

        # get last n games
        last_n = participated.tail(n)

        # calculate games per game
        goal_as_home = (last_n["home_team"] == team) * last_n["home_goals"]
        goal_as_away = (last_n["away_team"] == team) * last_n["away_goals"]

        # calculate mean across all home and away games
        mean = (goal_as_home + goal_as_away).mean()

        result.append(mean)

    return result

# custom function so the name of the feature prints out correctly
def make_name(self):
    return "%s_goal_last_%d" % (self.kwargs['which_team'], self.kwargs['n'])


AvgGoalPreviousNGames = make_trans_primitive(function=avg_goals_previous_n_games,
                                          input_types=[Categorical, Categorical, Numeric, Numeric],
                                          return_type=Numeric,
                                          cls_attributes={"generate_name": make_name, "uses_full_entity":True})

最后,我们可以计算特征矩阵

input_vars = [es["matches"]["home_team"], es["matches"]["away_team"], es["matches"]["home_goals"], es["matches"]["away_goals"]]
home_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=1)
home_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=3)
home_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="home_team", n=5)
away_team_last1 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=1)
away_team_last3 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=3)
away_team_last5 = AvgGoalPreviousNGames(*input_vars, which_team="away_team", n=5)

features = [home_team_last1, home_team_last3, home_team_last5,
            away_team_last1, away_team_last3, away_team_last5]

这将返回

fm = ft.calculate_feature_matrix(entityset=es, features=features)

最后,我们还可以将这些手动定义的特征用作使用深度特征综合的自动化特征工程的输入,这在here中进行了说明。通过将手动定义的功能传递为 home_team_goal_last_1 home_team_goal_last_3 home_team_goal_last_5 away_team_goal_last_1 away_team_goal_last_3 away_team_goal_last_5 match_id 1 NaN NaN NaN NaN NaN NaN 2 2.0 NaN NaN 0.0 NaN NaN 3 1.0 NaN NaN 0.0 NaN NaN 4 3.0 1.000000 NaN 0.0 1.000000 NaN 5 1.0 1.333333 NaN 1.0 0.666667 NaN 6 2.0 2.000000 1.2 0.0 0.333333 0.8 7 1.0 0.666667 0.6 2.0 1.666667 1.6 8 2.0 1.000000 0.8 2.0 2.000000 2.0 9 0.0 1.000000 0.8 1.0 1.666667 1.6 10 3.0 2.000000 2.0 1.0 1.000000 0.8 11 3.0 2.333333 2.2 1.0 0.666667 1.0 12 2.0 2.666667 2.2 2.0 1.333333 1.2 seed_features将自动堆叠在它们之上。

ft.dfs

fm, feature_defs = ft.dfs(entityset=es, target_entity="matches", seed_features=features, agg_primitives=[], trans_primitives=["day", "month", "year", "weekday", "percentile"])

feature_defs

特征矩阵为

[<Feature: home_team>,
 <Feature: away_team>,
 <Feature: home_goals>,
 <Feature: away_goals>,
 <Feature: label>,
 <Feature: home_team_goal_last_1>,
 <Feature: home_team_goal_last_3>,
 <Feature: home_team_goal_last_5>,
 <Feature: away_team_goal_last_1>,
 <Feature: away_team_goal_last_3>,
 <Feature: away_team_goal_last_5>,
 <Feature: DAY(match_date)>,
 <Feature: MONTH(match_date)>,
 <Feature: YEAR(match_date)>,
 <Feature: WEEKDAY(match_date)>,
 <Feature: PERCENTILE(home_goals)>,
 <Feature: PERCENTILE(away_goals)>,
 <Feature: PERCENTILE(home_team_goal_last_1)>,
 <Feature: PERCENTILE(home_team_goal_last_3)>,
 <Feature: PERCENTILE(home_team_goal_last_5)>,
 <Feature: PERCENTILE(away_team_goal_last_1)>,
 <Feature: PERCENTILE(away_team_goal_last_3)>,
 <Feature: PERCENTILE(away_team_goal_last_5)>]