自定义排序并在熊猫中排第一

时间:2020-08-04 07:00:57

标签: python pandas

我的csv如下

+-----+---------+-----------+------------+
| ID  | version | Name      | State      |
+-----+---------+-----------+------------+
| 101 | 1       | Nut       | In-Transit |
| 101 | 1       | Nut       | Cancelled  |
| 101 | 1       | Nut       | Delivered  |
| 101 | 2       | Nut 2.0   | In-Transit |
| 102 | 1       | Screw     | Shipped    |
| 102 | 1       | Screw     | In-Transit |
| 102 | 2       | Screw 2.0 | Shipped    |
| 102 | 2       | Screw 2.0 | Cancelled  |
+-----+---------+-----------+------------+

现在,我想在每个ID和版本组合的所有可用状态中处于最高状态(基于低于优先级)。

我的自定义订单

  1. 已交付
  2. 在途运输
  3. 发货
  4. 已取消

预期产量

+-----+---------+-----------+------------+
| ID  | version | Name      | State      |
+-----+---------+-----------+------------+
| 101 | 1       | Nut       | Delivered  |
| 101 | 2       | Nut 2.0   | In-Transit |
| 102 | 1       | Screw     | In-Transit |
| 102 | 2       | Screw 2.0 | Shipped    |
+-----+---------+-----------+------------+

我已经尝试过下面的查询,但是没有用。我是python的新手,我不确定如何解决此问题。

import pandas as pd

mydata = pd.read_csv('C:/Mypython/Newyork',encoding = "ISO-8859-1")

mydata['state'] = pd.Categorical(mydata['state'], ["Delivered","In-Transit","Shipped","Cancelled"])

mydate.sort_values('state').drop_duplicates(['ID','VERSION'],keep='first')

2 个答案:

答案 0 :(得分:1)

对于我来说,工作正常,似乎没有分配回新变量的步骤:

mydata['State'] = pd.Categorical(mydata['State'], 
                                ["Delivered", "In-Transit", "Shipped", "Cancelled"], 
                                 ordered=True)

#keep='first'is default value, so should be omitted
mydata = mydata.sort_values('state').drop_duplicates(['ID','version'])
print (mydata)
    ID  version       Name       state
2  101        1        Nut   Delivered
3  101        2    Nut 2.0  In-Transit
5  102        1      Screw  In-Transit
6  102        2  Screw 2.0     Shipped

如果要按ID对输出进行排序,version可以按多列添加排序:

mydata['State'] = pd.Categorical(mydata['State'], 
                                ["Delivered", "In-Transit", "Shipped", "Cancelled"], 
                                 ordered=True)
mydata = mydata.sort_values(['ID','version','state']).drop_duplicates(['ID','version'])

答案 1 :(得分:1)

使用pd.Categoricalordered=True创建一个分类变量,然后在此分类变量上使用sort_values,并在groupbyID, version上使用agg first

mydata['State'] = pd.Categorical(mydata['State'], ["Delivered", "In-Transit", "Shipped", "Cancelled"], ordered=True)
df = mydata.sort_values('State').groupby(['ID', 'version'], as_index=False).first()

结果:

    ID  version       Name       State
0  101        1        Nut   Delivered
1  101        2    Nut 2.0  In-Transit
2  102        1      Screw  In-Transit
3  102        2  Screw 2.0     Shipped