如何在列中过滤pandas或pyspark数据帧值?

时间:2017-12-10 00:29:55

标签: python pandas pyspark

我有一个包含此值的数据框:

+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
|     0.0|    0.0|           0.0| 8655|
|     1.0|    0.0|           0.0|  141|
|     0.0|    0.0|           1.0|  782|
|     1.0|    0.0|           1.0|  107|
|     0.0|    1.0|           0.0|   96|
|     0.0|    1.0|           1.0|   20|
|     1.0|    1.0|           1.0|   46|
|     1.0|    1.0|           0.0|  153|
+--------+-------+--------------+-----+

我希望在其他列中不重复“1”的行,如

+--------+-------+--------------+-----+
|tag_html|tag_css|tag_javascript|count|
+--------+-------+--------------+-----+
|     1.0|    0.0|           0.0|  141|
|     0.0|    0.0|           1.0|  782|
|     0.0|    1.0|           0.0|   96|

我使用函数where()

完成了这项工作
df['count'].where(((asdf['tag_html'] == 1) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 0)) & 
               ((asdf['tag_html'] == 0) | (asdf['tag_css'] == 1) | (asdf['tag_javascript'] == 0)) &
               ((asdf['tag_html'] == 0) | (asdf['tag_css'] == 0) | (asdf['tag_javascript'] == 1)))

这是结果

0    8655.0
1     141.0
2     782.0
3       NaN
4      96.0
5       NaN
6      46.0
7       NaN

在pandas或pyspark中有更好的方法吗?

1 个答案:

答案 0 :(得分:0)

使用mask和布尔索引

df=df.assign(count=df['count'].mask(df.iloc[:,:3].eq(1).sum(1).gt(1)))
df
Out[513]: 
   tag_html  tag_css  tag_javascript   count
0       0.0      0.0             0.0  8655.0
1       1.0      0.0             0.0   141.0
2       0.0      0.0             1.0   782.0
3       1.0      0.0             1.0     NaN
4       0.0      1.0             0.0    96.0
5       0.0      1.0             1.0     NaN
6       1.0      1.0             1.0     NaN
7       1.0      1.0             0.0     NaN