Question

我有一个客户数据框，其中包含收到的货件记录。不幸的是，这些可能重叠。我试图减少行数，以便我可以看到连续使用的日期。有没有什么方法可以做到这一点，除了蛮力iterrows实施？

以下是一个示例以及我想做的事情：

df = pd.DataFrame([['A','2011-02-07','2011-02-22',1],['A','2011-02-14','2011-03-10',2],['A','2011-03-07','2011-03-15',3],['A','2011-03-18','2011-03-25',4]], columns = ['Cust','startDate','endDate','shipNo'])
df

condensedDf = df.groupby(['Cust']).apply(reductionFunction)
condensedDF

reductionFunction将前3个记录分组为一个，因为在每种情况下，下一个的开始日期都在前一个结束日期之前。我基本上把多个重叠的记录转换成一条记录。

关于好的＆＃34; pythonic＆＃34;实施？我可以在每个小组内做一个讨厌的循环，但我不想......

Answer 1

从根本上说，我认为这是一个图形连接问题：解决它的一种快速方法是某种图形连接算法。熊猫不包括此类工具，但scipy does。您可以使用scipy中的压缩稀疏图（csgraph）子模块来解决您的问题，如下所示：

from scipy.sparse.csgraph import connected_components

# convert to datetime, so min() and max() work
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)

def reductionFunction(data):
    # create a 2D graph of connectivity between date ranges
    start = data.startDate.values
    end = data.endDate.values
    graph = (start <= end[:, None]) & (end >= start[:, None])

    # find connected components in this graph
    n_components, indices = connected_components(graph)

    # group the results by these connected components
    return data.groupby(indices).aggregate({'startDate': 'min',
                                            'endDate': 'max',
                                            'shipNo': 'first'})

df.groupby(['Cust']).apply(reductionFunction).reset_index('Cust')

如果你想从这里做一些与shipNo不同的事情，那应该非常简单。

请注意，上面的connected_components()函数不是暴力破解，而是使用fast algorithm来查找连接。

Answer 2

如果你打开使用辅助数据框来保存结果，你可以循环遍历所有行，说实话

from time import strptime

results = [df.iloc[0]]

for i, (_, current_row) in enumerate(df1.iterrows()):
    try:
        next_row = df.iloc[i+1]        
        if strptime(current_row['endDate'], '%Y-%M-%d') < strptime(next_row['startDate'], '%Y-%M-%d'):
            results[-1]['endDate'] = current_row['endDate']
            results.append(next_row)
    except IndexError:
        pass

print pd.DataFrame(results).reset_index(drop=True)

熊猫根据日期组合行

2 个答案: