dask dataframe groupby和multi index引发ValueError

时间:2018-04-15 15:41:53

标签: python dataframe dask

我有以下数据框

dataframe = pd.DataFrame({'_id':['id1', 'id2', 'id3', 'id4'],'http_user':['user1', 'user1', 'user2', 'user2'], 'dst': ['1.1.1.1', '1.1.1.1', '2.2.2.2', '2.2.2.2'], 'dst_port':[80,80,80,80], 'score':[40, 50, 10, 80]})
alerts = dd.from_pandas(dataframe, npartitions=5)

警报数据通常是根据Elasticsearch下载的数据创建的。 然后我从2个空表

的MySQL表中获取一些数据
alert_status_change = dd.read_sql_table('api_alerts_status_change',
                                                    db_url, index_col='id',
                                                    npartitions=4,
                                                    columns=['status',
                                                             'alert_id',
                                                             'reason_id']).reset_index()

reasons = dd.read_sql_table('api_alerts_reasons', db_url,
                            index_col='id', npartitions=4).reset_index()

在此之前,我可以成功执行以下聚合:

alerts.groupby(['http_user', 'dst', 'dst_port']).score.max().compute()

通常我想合并警报和db表中的两个数据帧,然后执行groupby。执行以下合并:

alerts = alerts.merge(alert_status_change, left_on='_id',
                              right_on='alert_id')
alerts = alerts.merge(reasons, left_on='reason_id',
                              right_on='id')

但是尝试进行相同的groupby聚合会给我带来以下错误:

distributed.worker - WARNING -  Compute Failed
Function:  execute_task
args:      ((<built-in function apply>, <function _groupby_aggregate at 0x7f9d28aabb90>, [(<function _concat at 0x7f9d28af0488>, [Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: [], Empty DataFrame
Columns: []
Index: []])], {'levels': [0, 1, 2], 'aggfunc': <methodcaller: max>}))
kwargs:    {}
Exception: ValueError('multiple levels only valid with MultiIndex',)

python堆栈跟踪(最后一行)

/home/avlach/virtualenvs/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in __init__
    514                                                     level=level,
    515                                                     sort=sort,
--> 516                                                     mutated=self.mutated)
    517 
    518         self.obj = obj

/home/avlach/virtualenvs/venv/local/lib/python2.7/site-packages/pandas/core/groupby.pyc in _get_grouper()
   2830                     raise ValueError('No group keys passed!')
   2831                 else:
-> 2832                     raise ValueError('multiple levels only valid with '
   2833                                      'MultiIndex')
   2834 

ValueError: multiple levels only valid with MultiIndex

在执行perfby并检查_meta.index属性时,我得到:

>> alerts.groupby(['http_user', 'dst', 'dst_port']).score.max()._meta.index
>> Index([], dtype='object')

手动将其设置为MultiIndex无效:

max_score = alerts.groupby(
            ['http_user', 'dst', 'dst_port']).score.max()
max_score._meta.index = pd.MultiIndex(levels=[[], [], []], labels=[[], [], []], names=['http_user', 'dst', 'dst_port'])

max_score._meta.index
MultiIndex(levels=[[], [], []],
       labels=[[], [], []],
       names=[u'http_user', u'dst', u'dst_port'])

上面的代码在我使用虚拟数据帧时有效。但是当我从Elasticsearch收到警报时,它就不会。这个错误来自哪里?

0 个答案:

没有答案