Question

我有一个带有DateTimeIndex的DataFrame，一个我要分组的列和一个包含整数集的列：

import pandas as pd

df = pd.DataFrame([['2018-01-01', 1, {1, 2, 3}],
                   ['2018-01-02', 1, {3}],
                   ['2018-01-03', 1, {3, 4, 5}],
                   ['2018-01-04', 1, {5, 6}],
                   ['2018-01-01', 2, {7}],
                   ['2018-01-02', 2, {8}],
                   ['2018-01-03', 2, {9}],
                   ['2018-01-04', 2, {10}]],
                  columns=['timestamp', 'group', 'ids'])

df['timestamp'] = pd.to_datetime(df['timestamp'])
df.set_index('timestamp', inplace=True)

            group        ids
timestamp                   
2018-01-01      1  {1, 2, 3}
2018-01-02      1        {3}
2018-01-03      1  {3, 4, 5}
2018-01-04      1     {5, 6}
2018-01-01      2        {7}
2018-01-02      2        {8}
2018-01-03      2        {9}
2018-01-04      2       {10}

在每个小组中，我都希望在最近的x天内构建一个滚动集联合。因此，假设X = 3，结果应为：

            group              ids
timestamp                   
2018-01-01      1        {1, 2, 3}
2018-01-02      1        {1, 2, 3}
2018-01-03      1  {1, 2, 3, 4, 5}
2018-01-04      1     {3, 4, 5, 6}
2018-01-01      2              {7}
2018-01-02      2           {7, 8}
2018-01-03      2        {7, 8, 9}
2018-01-04      2       {8, 9, 10}

从答案my previous question开始，我很好地知道如何在不进行分组的情况下进行此操作，因此到目前为止，我提出了以下解决方案：

grouped = df.groupby('group')
new_df = pd.DataFrame()
for name, group in grouped:
    group['ids'] = [
        set.union(*group['ids'].to_frame().iloc(axis=1)[max(0, i-2): i+1,0])
        for i in range(len(group.index))
    ]
    new_df = new_df.append(group)

哪个给出了正确的结果，但看起来很笨拙，还给出了以下警告：

SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

不过，提供的链接上的文档似乎并不完全适合我的实际情况。（在这种情况下，至少我没有意义。）

我的问题：如何改进此代码，使其更干净，性能更好，并且不会抛出警告消息？

Answer 1

与mentioned in the docs一样，不要循环使用pd.DataFrame.append；这样做会很昂贵。

相反，使用list并输入pd.concat。

您可以通过在列表中创建数据副本来避免SettingWithCopyWarning，即通过列表理解中的assign + iloc来避免chained indexing：

L = [group.assign(ids=[set.union(*group.iloc[max(0, i-2): i+1, -1]) \
                       for i in range(len(group.index))]) \
     for _, group in df.groupby('group')]

res = pd.concat(L)

print(res)

            group              ids
timestamp                         
2018-01-01      1        {1, 2, 3}
2018-01-02      1        {1, 2, 3}
2018-01-03      1  {1, 2, 3, 4, 5}
2018-01-04      1     {3, 4, 5, 6}
2018-01-01      2              {7}
2018-01-02      2           {8, 7}
2018-01-03      2        {8, 9, 7}
2018-01-04      2       {8, 9, 10}

Pandas DataFrame：在多个组上滚动集并集聚合

1 个答案: