Question

我有一个这样的数据框：

+----+--------------+-----------+---------------------------------------------------+-----------+
|    | Filename     | Result    | IssueType                                         | isBad     |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3  | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad       | Extra Spaces                                      | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+

我想按Filename和Result对行进行分组，为此我进行了查询：
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()

但是如果IssueType列为NaN 并且至少存在另一行具有相同文件名的行，我想将isBad值删除（到('No', 'Tolerable')） ，其中isBad列的值为'Bad'。

如果在isBad列的行'Bad'中没有任何行，则然后在IssueType中没有更改。

（例如，此处＃10行IssueType将是NaN，因为＃9具有相同的文件名但具有isBad = Yes）

之后输出数据框：

+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
|    | Filename     | Result    | IssueType                                         | isBad     |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0  | E0CCG5S237-0 | Bad       | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1  | E0CCG5S237-0 | Bad       | OCR_Text Misrecognition                           | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2  | E0CCG5S237-1 | Good      | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3  | E0CCG5S238-0 | Tolerable | NaN                                               | NaN       | #4's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4  | E0CCG5S238-0 | Tolerable | NaN                                               | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5  | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no   change |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 6  | E0CCG5S239-0 | Tolerable | Extra Spaces                                      | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 7  | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV                       | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+                                  |
| 8  | E0CCG5S239-0 | Tolerable | CS_Font Incosistency                              | Tolerable |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9  | E0CCG5S242-0 | Bad       | ML-OrphanContent                                  | Yes       |                                  |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad       | NaN                                               | Tolerable | #9's isBad is Yes                |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+

有没有办法做到这一点？

Answer 1

我认为您需要掩盖才能将Series.eq与GroupBy.transform和DataFrameGroupBy.any的Series.isin中的第一个isBad进行比较：

mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')

如果Filename匹配条件，则将Series.mask与isBad一起使用：

mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])

在https://gist.github.com/nucklehead/b568dc13d01b18b902c524754a7c9cd4中最后设置的缺失值具有链接条件，仅匹配Tolerable：

df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
       Filename     Result                    IssueType      isBad
0   E0CCG5S2370        Bad                          NaN        Yes
1   E0CCG5S2370        Bad      OCR_Text Misrecognition        Yes
2   E0CCG5S2371       Good                          NaN        Yes
3   E0CCG5S2380  Tolerable                          NaN  Tolerable
4   E0CCG5S2380  Tolerable                          NaN        Yes
5   E0CCG5S2390  Tolerable    MA_Superscript,subscript.  Tolerable
6   E0CCG5S2390  Tolerable                 Extra Spaces  Tolerable
7   E0CCG5S2390  Tolerable  MA_Link missing from the DV  Tolerable
8   E0CCG5S2390  Tolerable         CS_Font Incosistency  Tolerable
9   E0CCG5S2420        Bad              MLOrphanContent        Yes
10  E0CCG5S2420        Bad                          NaN  Tolerable

如果存在基于另一个列值的行，则熊猫会删除该列值

1 个答案: