我有一个包含恶意网址的数据框,其中 url_type 列的静态值 2
url_type url_type_txt
2 phishing/fraud
2 trojan
2 trojan
2 phishing
我需要在url_type
列中替换2到1,其中url_type_txt
有单词%phish%(可以是“网络钓鱼”,“网上诱骗网址”等)。
我尝试在for
循环中使用loc
,例如:
df3.loc[df3.url_type_txt=="phish", "url_type"] = 1
但这不是合适的解决方案。
有人能帮帮我吗? 谢谢!
答案 0 :(得分:1)
使用str.lower
(以确保抓住public MyViewHolder(View itemView) {
super(itemView);
linearLayout = (LinearLayout)itemView.findViewById(R.id.rootLayout);
}
以及Phish
)和str.contains()
:
phish
答案 1 :(得分:1)
使用字符串时,与向量化Pandas方法相比,常规列表理解可能更快:
In [5]: df.loc[['phish' in u for u in df.url_type_txt], 'url_type'] = 1
In [6]: df
Out[6]:
url_type url_type_txt
0 1 phishing/fraud
1 2 trojan
2 2 trojan
3 1 phishing
40.000行的时间DF:
In [7]: df = pd.concat([df] * 10**4, ignore_index=True)
In [8]: df.shape
Out[8]: (40000, 2)
In [9]: %timeit df.loc[df.url_type_txt.str.lower().str.contains('phish'), 'url_type']
103 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [10]: %timeit df.loc[['phish' in u for u in df.url_type_txt], 'url_type']
10.7 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.loc[['phish' in u.lower() for u in df.url_type_txt], 'url_type']
19.3 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [12]: %timeit df.loc[['phish' in u for u in df.url_type_txt.str.lower()], 'url_type']
41.1 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)