我有一个包含此值的数据框:
+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
| 0.0| 0.0| 0.0| 8655|
| 1.0| 0.0| 0.0| 141|
| 0.0| 0.0| 1.0| 782|
| 1.0| 0.0| 1.0| 107|
| 0.0| 1.0| 0.0| 96|
| 0.0| 1.0| 1.0| 20|
| 1.0| 1.0| 1.0| 46|
| 1.0| 1.0| 0.0| 153|
+--------+-------+--------------+-----+
我希望在其他列中不重复“1”的行,如
+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
| 1.0| 0.0| 0.0| 141|
| 0.0| 0.0| 1.0| 782|
| 0.0| 1.0| 0.0| 96|
我使用函数where()
df['count'].where(((asdf['tag_html'] == 1) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 0)) &
((asdf['tag_html'] == 0) | (asdf['tag_css'] == 1) | (asdf['tag_javascript'] == 0)) &
((asdf['tag_html'] == 0) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 1)))
这是结果
0 8655.0
1 141.0
2 782.0
3 NaN
4 96.0
5 NaN
6 46.0
7 NaN
在pandas或pyspark中有更好的方法吗?
答案 0 :(得分:0)
使用mask
和布尔索引
df=df.assign(count=df['count'].mask(df.iloc[:,:3].eq(1).sum(1).gt(1)))
df
Out[513]:
tag_html tag_css tag_javascript count
0 0.0 0.0 0.0 8655.0
1 1.0 0.0 0.0 141.0
2 0.0 0.0 1.0 782.0
3 1.0 0.0 1.0 NaN
4 0.0 1.0 0.0 96.0
5 0.0 1.0 1.0 NaN
6 1.0 1.0 1.0 NaN
7 1.0 1.0 0.0 NaN