熊猫-ValueError:无法从重复的轴重新索引

时间:2020-02-07 18:58:13

标签: python pandas

我正在Airflow中的数据管道上工作,并且一直遇到我一直困扰着我的ValueError: cannot reindex from a duplicate axis

这里是混乱的功能:

def fill_missing_dates(df):
    df['TUNING_EVNT_START_DT'] = pd.to_datetime(df['TUNING_EVNT_START_DT'])
    dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
    masdiv = df['MASDIV'].unique()
    station = df['STATION'].unique()
    idx = pd.MultiIndex.from_product((dates, masdiv, station), names=['TUNING_EVNT_START_DT', 'MASDIV', 'STATION'])
    df = df.set_index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION']).reindex(idx, fill_value=0).reset_index()

    return df

这是AWS Cloudwatch日志中的错误输出:

16:31:40
dates = df.set_index('TUNING_EVNT_START_DT').resample('D').asfreq().index
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 821, in asfreq
16:31:40
return self._upsample("asfreq", fill_value=fill_value)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/resample.py", line 1125, in _upsample
16:31:40
res_index, method=method, limit=limit, fill_value=fill_value
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/util/_decorators.py", line 221, in wrapper
16:31:40
return func(*args, **kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3976, in reindex
16:31:40
return super().reindex(**kwargs)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4514, in reindex
16:31:40
axes, level, limit, tolerance, method, fill_value, copy
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3864, in _reindex_axes
16:31:40
index, method, copy, level, fill_value, limit, tolerance
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/frame.py", line 3886, in _reindex_index
16:31:40
allow_dups=False,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/generic.py", line 4577, in _reindex_with_indexers
16:31:40
copy=copy,
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/internals/managers.py", line 1251, in reindex_indexer
16:31:40
self.axes[axis]._can_reindex(indexer)
16:31:40
File "/usr/local/lib64/python3.7/site-packages/pandas/core/indexes/base.py", line 3362, in _can_reindex
16:31:40
raise ValueError("cannot reindex from a duplicate axis")
16:31:40
ValueError: cannot reindex from a duplicate axis
16:31:40
"""
16:31:40
The above exception was the direct cause of the following exception:
16:31:40
Traceback (most recent call last):
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 275, in <module>
16:31:40
runner(path_prefix, model_name, execution_id, table)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 230, in runner
16:31:40
df = multiprocessing(PROCESSORS, df)
16:31:40
File "/tmp/scripts/anomaly_detection_model.py", line 121, in multiprocessing
16:31:40
x = pool.map(iforest, (df.loc[df['MASDIV'] == masdiv] for masdiv in args))
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 268, in map
16:31:40
return self._map_async(func, iterable, mapstar, chunksize).get()
16:31:40
File "/usr/lib64/python3.7/multiprocessing/pool.py", line 657, in get
16:31:40
raise self._value
16:31:40
ValueError: cannot reindex from a duplicate axis

我已经运行了一些记录器,以了解关于该步骤的数据帧输出的信息,但是我没有看到问题的根源:

18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.index(): RangeIndex(start=0, stop=93, step=1)
18:40:34
20/02/07 18:40:34 - INFO - __main__ - Where it breaks: df.columns: Index(['TUNING_EVNT_START_DT', 'MASDIV', 'STATION', 'DOW', 'MOY',
18:40:34
'TRANSACTIONS', 'DOW_INT', 'MOY_INT', 'DT_NBR'],
18:40:34
dtype='object')

我已经尝试了这些帖子中的所有内容,但无济于事:

Pandas error: cannot reindex from a duplicate axis

What does `ValueError: cannot reindex from a duplicate axis` mean?

我也不完全知道我为什么会这样。任何建议,不胜感激。

1 个答案:

答案 0 :(得分:1)

没有示例数据,我无法重现您的错误。但是,基于该函数的名称“ fill_missing_dates”,我认为这种替代解决方案可能会完成您要实现的目标。

import pandas as pd

df = pd.DataFrame({
    'date': ["2020-01-01 00:01:00", "2020-01-01 00:02:00", "2020-01-01 01:00:00", "2020-01-01 02:00:00",
             "2020-01-01 00:04:00", "2020-01-01 00:05:00",
             "2020-01-03 00:01:00", "2020-01-03 00:02:00", "2020-01-03 01:00:00", "2020-01-03 02:00:00",
             "2020-01-03 00:04:00", "2020-01-03 00:05:00",
            ],
    'station': ["a","a","a","a","b", "b", "a", "a", "a", "a", "b", "b"],
    'data': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})

def resampler(x):    
    return x.set_index('date').resample('D').sum()

df['date'] =  pd.to_datetime(df['date'])
multipass = pd.MultiIndex.from_frame(df[["date", "station"]])
df = df.set_index(["date", "station"])
df = df.reindex(multipass)
df.reset_index(level=0).groupby(level=0).apply(resampler)

结果用0填充缺失的日期:

                        data
station  date   
a        2020-01-01     10
         2020-01-02     0
         2020-01-03     34
b        2020-01-01     11
         2020-01-02     0
         2020-01-03     23