Question

我有一个看起来像这样的数据框：

Publication Date        Date              Value
2018-01-01              2018-01-01        2
2018-01-01              2018-01-02        13
2018-01-01              2018-01-03        14
2018-01-01              2018-01-04        12
2018-01-02              2018-01-02        1.5
2018-01-02              2018-01-03        14
2018-01-02              2018-01-04        15
2018-01-02              2018-01-05        15.5
2018-01-03              2018-01-03        1.8
2018-01-03              2018-01-04        13
2018-01-03              2018-01-05        17
2018-01-03              2018-01-06        15
.
.

我想删除Publication Date发生变化的数据的第一行，因为每次迭代的值都非常小。输出如下：

Publication Date        Date              Value
2018-01-01              2018-01-02        13
2018-01-01              2018-01-03        14
2018-01-01              2018-01-04        12
2018-01-02              2018-01-03        14
2018-01-02              2018-01-04        15
2018-01-02              2018-01-05        15.5
2018-01-03              2018-01-04        13
2018-01-03              2018-01-05        17
2018-01-03              2018-01-06        15
.
.

数据基本上采用这种格式，但包括未显示的额外列（即：Date每隔Publication Date按Date + 1切换一次。）

最好的方法是什么？

Answer 1

您可以将布尔索引与shift配合使用

df[df['Publication Date'] == df['Publication Date'].shift()]


    Publication Date    Date    Value
1   2018-01-01  2018-01-02  13.0
2   2018-01-01  2018-01-03  14.0
3   2018-01-01  2018-01-04  12.0
5   2018-01-02  2018-01-03  14.0
6   2018-01-02  2018-01-04  15.0
7   2018-01-02  2018-01-05  15.5
9   2018-01-03  2018-01-04  13.0
10  2018-01-03  2018-01-05  17.0
11  2018-01-03  2018-01-06  15.0

Answer 2

使用duplicated：

res = df[df.duplicated(subset=['PublicationDate'])]

或更笼统地使用cumcount或tail和groupby：

res = df[df.groupby('PublicationDate').cumcount() > 0]

res = df.groupby('PublicationDate').apply(lambda x: x.tail(len(x)-1))\
        .reset_index(drop=True)

print(res)

  PublicationDate        Date  Value
0      2018-01-01  2018-01-02   13.0
1      2018-01-01  2018-01-03   14.0
2      2018-01-01  2018-01-04   12.0
3      2018-01-02  2018-01-03   14.0
4      2018-01-02  2018-01-04   15.0
5      2018-01-02  2018-01-05   15.5
6      2018-01-03  2018-01-04   13.0
7      2018-01-03  2018-01-05   17.0
8      2018-01-03  2018-01-06   15.0

在数据框值Python

2 个答案: