根据自定义条件从csv获取最新行

时间:2020-07-30 14:12:45

标签: python pandas

我有一张桌子material

+--------+-----+-------------------+----------------+-----------+          
| ID     | REV | name              | Description    | curr      |
+--------+-----+-------------------+----------------+-----------+
| 211-32 | 001 | Screw 1.0         | Used in MAT 1  | READY     |
| 211-32 | 002 | Screw 2 plus      | can be Used-32 | WITHDRAWN |
| 212-41 | 001 | Bolt H1           | Light solid    | READY     |
| 212-41 | 002 | BOLT H2+Form      | Heavy solid    | READY     |
| 101-24 | 001 | HexHead 1-A       | NOR-1          | READY     |
| 101-24 | 002 | HexHead Spl       | NOR-22         | READY     |
| 423-98 | 001 | Nut Repair spare  | NORM1          | READY     |
| 423-98 | 002 | Nut Repair Part-C | NORM2          | WITHDRAWN |
| 423-98 | 003 | Nut SP-C          | NORM2+NORM1    | NULL      |
| 654-01 | 001 | Bar               | Specific only  | WITHDRAWN |
| 654-01 | 002 | Bar rod-S         | Designed+Spe   | WITHDRAWN |
| 654-01 | 003 | Bar OPG           | Hard spec      | NULL      |
+--------+-----+-------------------+----------------+-----------+

每个ID可以有多个修订版本。我想采用最新版本(即最高001,002,003等)。但是,如果最新修订版将curr作为NULL(字符串)或WITHDRAWN,则我将采用先前的修订版及其对应的值。如果curr甚至是NULLWITHDRAWN,我都必须再次转到先前的版本。如果所有修订都存在相同的问题,那么我们可以忽略它。所以预期的输出是

+--------+-----+------------------+---------------+-------+
| ID     | REV | name             | Description   | curr  |
+--------+-----+------------------+---------------+-------+
| 211-32 | 001 | Screw 1.0        | Used in MAT 1 | READY |
| 212-41 | 002 | BOLT H2+Form     | Heavy solid   | READY |
| 101-24 | 002 | HexHead Spl      | NOR-22        | READY |
| 423-98 | 001 | Nut Repair spare | NORM1         | READY |
+--------+-----+------------------+---------------+-------+

我是Python的新手。我已经尝试了下面的代码,但是我没有工作。任何建议都将受到高度赞赏。

import pandas as pd
import numpy as np

mydata = pd.read_csv('C:/Myfolder/Python/myfile.csv')

mydata.sort_values(['ID','REV'], ascending=[True, False]).drop_duplicates('',keep=last)

3 个答案:

答案 0 :(得分:2)

您可以使用drop()选择其中没有NULL或WITHDRAW的行,然后执行list_managers_rf <- lapply(list_managers, "-", drop(risk_free)) lapply(list_managers_rf, tail, 2) ## [[1]] ## HAM1 HAM2 ## 2006-11-30 0 0.0089 ## 2006-12-31 0 -0.0177 ## ## [[2]] ## HAM3 HAM4 ## 2006-11-30 0.0152 0.0256 ## 2006-12-31 -0.0005 0.0091 isin

sort_values

答案 1 :(得分:2)

我们可以创建一个伪列以获取最大值并返回其索引。

第一步是过滤掉我们要忽略的值。

df1 = df.loc[
    df[~df["curr"].isin(["WITHDRAWN", "NULL"])]
    .assign(key=df["REV"].astype(int))
    .groupby("ID")["key"]
    .idxmax()
]


         ID  REV                 name       Description   curr
6   101-24   002   HexHead Spl          NOR-22           READY
1   211-32   001   Screw 1.0            Used in MAT 1    READY
4   212-41   002   BOLT H2+Form         Heavy solid      READY
7   423-98   001   Nut Repair spare     NORM1            READY

答案 2 :(得分:1)

我认为您首先应该从表中删除NULL或WITHDRAW。

mydata[mydata[curr] == 'Ready']       # this should do I think...

然后您可以尝试进行排序并获取最大转速值。

mydata = mydata.sort_values(['ID','REV']).drop_duplicates('ID',keep='last')