处理完毕后,我的数据是一个表,其中有几列是要素,一列是标签。我想用featuretools.dfs
来帮助我预测标签。可以直接进行,还是需要将单个表分成多个?
答案 0 :(得分:9)
可以在单个表上运行DFS。例如,如果你有一个带有索引df
的pandas数据帧'index'
,你会写:
import featuretools as ft
es = ft.EntitySet('Transactions')
es.entity_from_dataframe(dataframe=df,
entity_id='log',
index='index')
fm, features = ft.dfs(entityset=es,
target_entity='log',
trans_primitives=['day', 'weekday', 'month'])
生成的特征矩阵看起来像
In [1]: fm
Out[1]:
location pies sold WEEKDAY(date) MONTH(date) DAY(date)
index
1 main street 3 4 12 29
2 main street 4 5 12 30
3 main street 5 6 12 31
4 arlington ave. 18 0 1 1
5 arlington ave. 1 1 1 2
这会将“transform”原语应用于您的数据。您通常希望添加更多实体以提供ft.dfs
,以便使用聚合原语。您可以在我们的documentation中了解差异。
标准工作流程是通过一个有趣的分类来normalize您的单个实体。如果您的df
是单一表格
| index | location | pies sold | date |
|-------+----------------+-------+------------|
| 1 | main street | 3 | 2017-12-29 |
| 2 | main street | 4 | 2017-12-30 |
| 3 | main street | 5 | 2017-12-31 |
| 4 | arlington ave. | 18 | 2018-01-01 |
| 5 | arlington ave. | 1 | 2018-01-02 |
您可能会对按location
进行规范化感兴趣:
es.normalize_entity(base_entity_id='log',
new_entity_id='locations',
index='location')
您的新实体locations
将拥有该表
| location | first_log_time |
|----------------+----------------|
| main street | 2018-12-29 |
| arlington ave. | 2000-01-01 |
可以使locations.SUM(log.pies sold)
或locations.MEAN(log.pies sold)
等功能按位置添加或平均所有值。您可以在下面的示例中看到这些功能
In [1]: import pandas as pd
...: import featuretools as ft
...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
...: 'location': ['main street',
...: 'main street',
...: 'main street',
...: 'arlington ave.',
...: 'arlington ave.'],
...: 'pies sold': [3, 4, 5, 18, 1]})
...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
...: df
...:
Out[1]:
index location pies sold date
0 1 main street 3 2017-12-29
1 2 main street 4 2017-12-30
2 3 main street 5 2017-12-31
3 4 arlington ave. 18 2018-01-01
4 5 arlington ave. 1 2018-01-02
In [2]: es = ft.EntitySet('Transactions')
...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
...: ime_index='date')
...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
...: ex='location')
...:
Out[2]:
Entityset: Transactions
Entities:
log [Rows: 5, Columns: 4]
locations [Rows: 2, Columns: 2]
Relationships:
log.location -> locations.location
In [3]: fm, features = ft.dfs(entityset=es,
...: target_entity='log',
...: agg_primitives=['sum', 'mean'],
...: trans_primitives=['day'])
...: fm
...:
Out[3]:
location pies sold DAY(date) locations.DAY(first_log_time) locations.MEAN(log.pies sold) locations.SUM(log.pies sold)
index
1 main street 3 29 29 4.0 12
2 main street 4 30 29 4.0 12
3 main street 5 31 29 4.0 12
4 arlington ave. 18 1 1 9.5 19
5 arlington ave. 1 2 1 9.5 19