如果存在基于另一个列值的行,则熊猫会删除该列值

时间:2020-05-11 03:55:12

标签: python python-3.x pandas

我有一个这样的数据框:

+----+--------------+-----------+---------------------------------------------------+-----------+
|    | Filename     | Result    | IssueType                                         | isBad     |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3  | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad       | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+

我想按FilenameResult对行进行分组,为此我进行了查询:
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()

但是如果IssueType列为NaN 并且至少存在另一行具有相同文件名的行,我想将isBad值删除(到('No', 'Tolerable'),其中isBad列的值为'Bad'

如果在isBad列的行'Bad'中没有任何行,则然后在IssueType中没有更改

(例如,此处#10行IssueType将是NaN,因为#9具有相同的文件名但具有isBad = Yes

之后输出数据框:

+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
|    | Filename     | Result    | IssueType                                         | isBad     |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3  | E0CCG5S238-0 | Tolerable | NaN                                               | NaN       | #4's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no   change |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad       | NaN                                               | Tolerable | #9's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+

有没有办法做到这一点?

1 个答案:

答案 0 :(得分:1)

我认为您需要掩盖才能将Series.eqGroupBy.transformDataFrameGroupBy.anySeries.isin中的第一个isBad进行比较:

mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')

如果Filename匹配条件,则将Series.maskisBad一起使用:

mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])

https://gist.github.com/nucklehead/b568dc13d01b18b902c524754a7c9cd4中最后设置的缺失值具有链接条件,仅匹配Tolerable

df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
       Filename     Result                    IssueType      isBad
0   E0CCG5S2370        Bad                          NaN        Yes
1   E0CCG5S2370        Bad      OCR_Text Misrecognition        Yes
2   E0CCG5S2371       Good                          NaN        Yes
3   E0CCG5S2380  Tolerable                          NaN  Tolerable
4   E0CCG5S2380  Tolerable                          NaN        Yes
5   E0CCG5S2390  Tolerable    MA_Superscript,subscript.  Tolerable
6   E0CCG5S2390  Tolerable                 Extra Spaces  Tolerable
7   E0CCG5S2390  Tolerable  MA_Link missing from the DV  Tolerable
8   E0CCG5S2390  Tolerable         CS_Font Incosistency  Tolerable
9   E0CCG5S2420        Bad              MLOrphanContent        Yes
10  E0CCG5S2420        Bad                          NaN  Tolerable