我有以下数据框
dataframe = pd.DataFrame({'_id':['id1', 'id2', 'id3', 'id4'],'http_user':['user1', 'user1', 'user2', 'user2'], 'dst': ['1.1.1.1', '1.1.1.1', '2.2.2.2', '2.2.2.2'], 'dst_port':[80,80,80,80], 'score':[40, 50, 10, 80]})
alerts = dd.from_pandas(dataframe, npartitions=5)
警报数据通常是根据Elasticsearch下载的数据创建的。 然后我从2个空表
的MySQL表中获取一些数据alert_status_change = dd.read_sql_table('api_alerts_status_change',
db_url, index_col='id',
npartitions=4,
columns=['status',
'alert_id',
'reason_id']).reset_index()
reasons = dd.read_sql_table('api_alerts_reasons', db_url,
index_col='id', npartitions=4).reset_index()
在此之前,我可以成功执行以下聚合:
alerts.groupby(['http_user', 'dst', 'dst_port']).score.max().compute()
通常我想合并警报和db表中的两个数据帧,然后执行groupby。执行以下合并:
alerts = alerts.merge(alert_status_change, left_on='_id',
right_on='alert_id')
alerts = alerts.merge(reasons, left_on='reason_id',
right_on='id')
但是尝试进行相同的groupby聚合会给我带来以下错误:
distributed.worker - WARNING - Compute Failed
Function: execute_task
args: ((<built-in function apply>, <function _groupby_aggregate at 0x7f9d28aabb90>, [(<function _concat at 0x7f9d28af0488>, [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []])], {'levels': [0, 1, 2], 'aggfunc': <methodcaller: max>}))
kwargs: {}
Exception: ValueError('multiple levels only valid with MultiIndex',)
python堆栈跟踪(最后一行)
/home/avlach/virtualenvs/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__
514 level=level,
515 sort=sort,
--> 516 mutated=self.mutated)
517
518 self.obj = obj
/home/avlach/virtualenvs/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper()
2830 raise ValueError('No group keys passed!')
2831 else:
-> 2832 raise ValueError('multiple levels only valid with '
2833 'MultiIndex')
2834
ValueError: multiple levels only valid with MultiIndex
在执行perfby并检查_meta.index
属性时,我得到:
>> alerts.groupby(['http_user', 'dst', 'dst_port']).score.max()._meta.index
>> Index([], dtype='object')
手动将其设置为MultiIndex无效:
max_score = alerts.groupby(
['http_user', 'dst', 'dst_port']).score.max()
max_score._meta.index = pd.MultiIndex(levels=[[], [], []], labels=[[], [], []], names=['http_user', 'dst', 'dst_port'])
max_score._meta.index
MultiIndex(levels=[[], [], []],
labels=[[], [], []],
names=[u'http_user', u'dst', u'dst_port'])
上面的代码在我使用虚拟数据帧时有效。但是当我从Elasticsearch收到警报时,它就不会。这个错误来自哪里?