Question

处理完毕后，我的数据是一个表，其中有几列是要素，一列是标签。我想用featuretools.dfs来帮助我预测标签。可以直接进行，还是需要将单个表分成多个？

Answer 1

可以在单个表上运行DFS。例如，如果你有一个带有索引df的pandas数据帧'index'，你会写：

import featuretools as ft
es = ft.EntitySet('Transactions')

es.entity_from_dataframe(dataframe=df,
                         entity_id='log',
                         index='index')

fm, features = ft.dfs(entityset=es, 
                      target_entity='log',
                      trans_primitives=['day', 'weekday', 'month'])

生成的特征矩阵看起来像

In [1]: fm
Out[1]: 
             location  pies sold  WEEKDAY(date)  MONTH(date)  DAY(date)
index                                                                  
1         main street          3              4           12         29
2         main street          4              5           12         30
3         main street          5              6           12         31
4      arlington ave.         18              0            1          1
5      arlington ave.          1              1            1          2

这会将“transform”原语应用于您的数据。您通常希望添加更多实体以提供ft.dfs，以便使用聚合原语。您可以在我们的documentation中了解差异。

标准工作流程是通过一个有趣的分类来normalize您的单个实体。如果您的df是单一表格

| index | location       | pies sold |   date |
|-------+----------------+-------+------------|
|     1 | main street    |     3 | 2017-12-29 |
|     2 | main street    |     4 | 2017-12-30 |
|     3 | main street    |     5 | 2017-12-31 |
|     4 | arlington ave. |    18 | 2018-01-01 |
|     5 | arlington ave. |     1 | 2018-01-02 |

您可能会对按location进行规范化感兴趣：

es.normalize_entity(base_entity_id='log',
                    new_entity_id='locations',
                    index='location')

您的新实体locations将拥有该表

| location       | first_log_time |
|----------------+----------------|
| main street    |     2018-12-29 |
| arlington ave. |     2000-01-01 |

可以使locations.SUM(log.pies sold)或locations.MEAN(log.pies sold)等功能按位置添加或平均所有值。您可以在下面的示例中看到这些功能

In [1]: import pandas as pd
   ...: import featuretools as ft
   ...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
   ...:                    'location': ['main street',
   ...:                                 'main street',
   ...:                                 'main street',
   ...:                                 'arlington ave.',
   ...:                                 'arlington ave.'],
   ...:                    'pies sold': [3, 4, 5, 18, 1]})
   ...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
   ...: df
   ...: 

Out[1]: 
   index        location  pies sold       date
0      1     main street          3 2017-12-29
1      2     main street          4 2017-12-30
2      3     main street          5 2017-12-31
3      4  arlington ave.         18 2018-01-01
4      5  arlington ave.          1 2018-01-02

In [2]: es = ft.EntitySet('Transactions')
   ...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
   ...: ime_index='date')
   ...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
   ...: ex='location')
   ...: 
Out[2]: 
Entityset: Transactions
  Entities:
    log [Rows: 5, Columns: 4]
    locations [Rows: 2, Columns: 2]
  Relationships:
    log.location -> locations.location

In [3]: fm, features = ft.dfs(entityset=es,
   ...:                       target_entity='log',
   ...:                       agg_primitives=['sum', 'mean'],
   ...:                       trans_primitives=['day'])
   ...: fm
   ...: 
Out[3]: 
             location  pies sold  DAY(date)  locations.DAY(first_log_time)  locations.MEAN(log.pies sold)  locations.SUM(log.pies sold)
index                                                                                                                                  
1         main street          3         29                             29                            4.0                            12
2         main street          4         30                             29                            4.0                            12
3         main street          5         31                             29                            4.0                            12
4      arlington ave.         18          1                              1                            9.5                            19
5      arlington ave.          1          2                              1                            9.5                            19

如何将深度特征合成应用于单个表

1 个答案: