dataframe:条件替换

时间:2018-06-11 20:19:49

标签: python pandas dataframe

我有一个包含恶意网址的数据框,其中 url_type 列的静态值 2

    url_type    url_type_txt
    2           phishing/fraud
    2           trojan
    2           trojan
    2           phishing

我需要在url_type列中替换2到1,其中url_type_txt有单词%phish%(可以是“网络钓鱼”,“网上诱骗网址”等)。 我尝试在for循环中使用loc,例如:

df3.loc[df3.url_type_txt=="phish", "url_type"] = 1

但这不是合适的解决方案。

有人能帮帮我吗? 谢谢!

2 个答案:

答案 0 :(得分:1)

使用str.lower(以确保抓住public MyViewHolder(View itemView) { super(itemView); linearLayout = (LinearLayout)itemView.findViewById(R.id.rootLayout); } 以及Phish)和str.contains()

phish

答案 1 :(得分:1)

使用字符串时,与向量化Pandas方法相比,常规列表理解可能更快:

In [5]: df.loc[['phish' in u for u in df.url_type_txt], 'url_type'] = 1

In [6]: df
Out[6]:
   url_type    url_type_txt
0         1  phishing/fraud
1         2          trojan
2         2          trojan
3         1        phishing

40.000行的时间DF:

In [7]: df = pd.concat([df] * 10**4, ignore_index=True)

In [8]: df.shape
Out[8]: (40000, 2)

In [9]: %timeit df.loc[df.url_type_txt.str.lower().str.contains('phish'), 'url_type']
103 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit df.loc[['phish' in u for u in df.url_type_txt], 'url_type']
10.7 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit df.loc[['phish' in u.lower() for u in df.url_type_txt], 'url_type']
19.3 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit df.loc[['phish' in u for u in df.url_type_txt.str.lower()], 'url_type']
41.1 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)