我有一个这样的数据框:
+----+--------------+-----------+---------------------------------------------------+-----------+
| | Filename | Result | IssueType | isBad |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 0 | E0CCG5S237-0 | Bad | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 1 | E0CCG5S237-0 | Bad | OCR_Text Misrecognition | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 2 | E0CCG5S237-1 | Good | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 3 | E0CCG5S238-0 | Tolerable | MA_Form field elements (checkbox, line element... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 4 | E0CCG5S238-0 | Tolerable | NaN | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 5 | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 6 | E0CCG5S239-0 | Tolerable | Extra Spaces | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 7 | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 8 | E0CCG5S239-0 | Tolerable | CS_Font Incosistency | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 9 | E0CCG5S242-0 | Bad | ML-OrphanContent | Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+
| 10 | E0CCG5S242-0 | Bad | Extra Spaces | Tolerable |
+----+--------------+-----------+---------------------------------------------------+-----------+
我想按Filename
和Result
对行进行分组,为此我进行了查询:
subj_score_df = subj_score_df.fillna('').groupby(['Filename', 'Result'])['IssueType'].apply('\n'.join).reset_index()
但是如果IssueType
列为NaN
并且至少存在另一行具有相同文件名的行,我想将isBad
值删除(到('No', 'Tolerable')
) ,其中isBad
列的值为'Bad'
。
如果在isBad
列的行'Bad'
中没有任何行,则然后在IssueType中没有更改。
(例如,此处#10行IssueType
将是NaN
,因为#9具有相同的文件名但具有isBad = Yes
)
之后输出数据框:
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| | Filename | Result | IssueType | isBad | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 0 | E0CCG5S237-0 | Bad | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 1 | E0CCG5S237-0 | Bad | OCR_Text Misrecognition | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 2 | E0CCG5S237-1 | Good | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 3 | E0CCG5S238-0 | Tolerable | NaN | NaN | #4's isBad is Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 4 | E0CCG5S238-0 | Tolerable | NaN | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 5 | E0CCG5S239-0 | Tolerable | MA_Superscript,subscript and dropcap identific... | Tolerable | All are tolerable so no change |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 6 | E0CCG5S239-0 | Tolerable | Extra Spaces | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 7 | E0CCG5S239-0 | Tolerable | MA_Link missing from the DV | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+ |
| 8 | E0CCG5S239-0 | Tolerable | CS_Font Incosistency | Tolerable | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 9 | E0CCG5S242-0 | Bad | ML-OrphanContent | Yes | |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
| 10 | E0CCG5S242-0 | Bad | NaN | Tolerable | #9's isBad is Yes |
+----+--------------+-----------+---------------------------------------------------+-----------+----------------------------------+
有没有办法做到这一点?
答案 0 :(得分:1)
我认为您需要掩盖才能将Series.eq
与GroupBy.transform
和DataFrameGroupBy.any
的Series.isin
中的第一个isBad
进行比较:
mask = df['isBad'].eq('Yes').groupby(df['Filename']).transform('any')
如果Filename
匹配条件,则将Series.mask
与isBad
一起使用:
mask = df['Filename'].isin(df.loc[df['isBad'].eq('Yes'), 'Filename'])
在https://gist.github.com/nucklehead/b568dc13d01b18b902c524754a7c9cd4中最后设置的缺失值具有链接条件,仅匹配Tolerable
:
df['IssueType'] = df['IssueType'].mask(mask & df['isBad'].eq('Tolerable'))
print (df)
Filename Result IssueType isBad
0 E0CCG5S2370 Bad NaN Yes
1 E0CCG5S2370 Bad OCR_Text Misrecognition Yes
2 E0CCG5S2371 Good NaN Yes
3 E0CCG5S2380 Tolerable NaN Tolerable
4 E0CCG5S2380 Tolerable NaN Yes
5 E0CCG5S2390 Tolerable MA_Superscript,subscript. Tolerable
6 E0CCG5S2390 Tolerable Extra Spaces Tolerable
7 E0CCG5S2390 Tolerable MA_Link missing from the DV Tolerable
8 E0CCG5S2390 Tolerable CS_Font Incosistency Tolerable
9 E0CCG5S2420 Bad MLOrphanContent Yes
10 E0CCG5S2420 Bad NaN Tolerable