Question

我试图从groupby中的第一个非连续'Period'开始删除数据帧中的所有行。如果可能的话，我宁愿避免循环。

import pandas as pd


data = {'Country': ['DE', 'DE', 'DE', 'DE', 'DE', 'US', 'US', 'US', 'US','US'],
    'Product': ['Blue', 'Blue', 'Blue', 'Blue','Blue','Green', 'Green', 'Green', 'Green','Green'],
    'Period': [1, 2, 3,5,6, 1, 2, 4, 5, 6]}

df = pd.DataFrame(data, columns= ['Country','Product', 'Period'])
print df

输出：

  Country Product  Period
0      DE    Blue       1
1      DE    Blue       2
2      DE    Blue       3
3      DE    Blue       5
4      DE    Blue       6
5      US   Green       1
6      US   Green       2
7      US   Green       4
8      US   Green       5
9      US   Green       6

例如，我想要的最终输出如下：

  Country Product  Period
0      DE    Blue       1
1      DE    Blue       2
2      DE    Blue       3
5      US   Green       1
6      US   Green       2

我试图这样做的方式是下面给你一个想法，但我有很多错误。但你可能会看到我想要做的逻辑。

df = df.groupby(['Country','Product']).apply(lambda x: x[x.Period.shift(x.Period - 1) == 1]).reset_index(drop=True)

棘手的部分不仅仅是使用.shift（1）或者我试图在.shift（）中输入一个值，即如果那行Period是5那么我想说.shift（5-1）所以它向上移动4个位置并检查该期间的值。如果它等于1则表示它仍然是顺序的。在这种情况下，我猜它会进入南区。

Answer 1

您可以使用shift()和diff()来代替使用cumsum()：

result = grouped['Period'].apply(
    lambda x: x.loc[(x.diff() > 1).cumsum() == 0])

import pandas as pd

data = {'Country': ['DE', 'DE', 'DE', 'DE', 'DE', 'US', 'US', 'US', 'US','US'],
    'Product': ['Blue', 'Blue', 'Blue', 'Blue','Blue','Green', 'Green', 'Green', 'Green','Green'],
    'Period': [1, 2, 3,5,6, 1, 2, 4, 5, 6]}

df = pd.DataFrame(data, columns= ['Country','Product', 'Period'])
print(df)
grouped = df.groupby(['Country','Product'])
result = grouped['Period'].apply(
    lambda x: x.loc[(x.diff() > 1).cumsum() == 0])
result.name = 'Period'
result = result.reset_index(['Country', 'Product'])
print(result)

产量

  Country Product  Period
0      DE    Blue       1
1      DE    Blue       2
2      DE    Blue       3
5      US   Green       1
6      US   Green       2

<强>解释：

数字的连续运行具有1的相邻差异。例如，如果我们暂时将df['Period']视为所有一组的一部分，

In [41]: df['Period'].diff()
Out[41]: 
0   NaN
1     1
2     1
3     2
4     1
5    -5
6     1
7     2
8     1
9     1
Name: Period, dtype: float64

In [42]: df['Period'].diff() > 1
Out[42]: 
0    False
1    False
2    False
3     True       <--- We want to cut off before here
4    False
5    False
6    False
7     True
8    False
9    False
Name: Period, dtype: bool

要找到截止位置 - True中的第一个df['Period'].diff() > 1 - 我们可以使用cumsum()，并选择那些等于0的行：

In [43]: (df['Period'].diff() > 1).cumsum()
Out[43]: 
0    0
1    0
2    0
3    1
4    1
5    1
6    1
7    2
8    2
9    2
Name: Period, dtype: int64

In [44]: (df['Period'].diff() > 1).cumsum() == 0
Out[44]: 
0     True
1     True
2     True
3    False
4    False
5    False
6    False
7    False
8    False
9    False
Name: Period, dtype: bool

考虑diff()和cumsum()似乎很浪费，因为这些操作可能会计算很多不需要的值 - 特别是如果x非常大而且第一次连续运行非常短。

尽管浪费，但通过调用NumPy或Pandas方法获得的速度（在C / Cython / C ++或Fortran中实现）通常会压低浪费算法用纯Python编码。

然而，您可以通过拨打cumsum

来取代对argmax的通话

result = grouped['Period'].apply(
    lambda x: x.loc[:(x.diff() > 1).argmax()].iloc[:-1])

对于非常大的x，这可能会更快一些：

x = df['Period']
x = pd.concat([x]*1000)
x = x.reset_index(drop=True)

In [68]: %timeit x.loc[:(x.diff() > 1).argmax()].iloc[:-1]
1000 loops, best of 3: 884 µs per loop

In [69]: %timeit x.loc[(x.diff() > 1).cumsum() == 0]
1000 loops, best of 3: 1.12 ms per loop

但请注意，argmax返回索引级别值，而不是序数索引地点。因此，如果x.index包含重复项，则使用argmax将无效值。（这就是为什么我必须设置x = x.reset_index(drop=True)。）

因此，虽然在某些情况下使用argmax会更快一些，但这种替代方案并不那么强大。

Answer 2

对不起..我不知道熊猫..但一般来说它可以直接在python中实现。

zip(data['Country'],data['Product'],data['Period'])
and the result will be a list ..
[('DE', 'Blue', 1), ('DE', 'Blue', 2), ('DE', 'Blue', 3), ('DE', 'Blue', 5), 
('DE', 'Blue', 6), ('US', 'Green', 1), ('US', 'Green', 2), ('US', 'Green', 4),
('US', 'Green', 5), ('US', 'Green', 6)]

在此之后，结果可以轻松地输入到您的功能

如何使用.shift（）根据条件筛选Dataframe

2 个答案: