Question

我有一个包含恶意网址的数据框，其中 url_type 列的静态值 2

    url_type    url_type_txt
    2           phishing/fraud
    2           trojan
    2           trojan
    2           phishing

我需要在url_type列中替换2到1，其中url_type_txt有单词％phish％（可以是“网络钓鱼”，“网上诱骗网址”等）。我尝试在for循环中使用loc，例如：

df3.loc[df3.url_type_txt=="phish", "url_type"] = 1

但这不是合适的解决方案。

有人能帮帮我吗？谢谢！

Answer 1

使用str.lower（以确保抓住public MyViewHolder(View itemView) { super(itemView); linearLayout = (LinearLayout)itemView.findViewById(R.id.rootLayout); }以及Phish）和str.contains()：

phish

Answer 2

使用字符串时，与向量化Pandas方法相比，常规列表理解可能更快：

In [5]: df.loc[['phish' in u for u in df.url_type_txt], 'url_type'] = 1

In [6]: df
Out[6]:
   url_type    url_type_txt
0         1  phishing/fraud
1         2          trojan
2         2          trojan
3         1        phishing

40.000行的时间DF：

In [7]: df = pd.concat([df] * 10**4, ignore_index=True)

In [8]: df.shape
Out[8]: (40000, 2)

In [9]: %timeit df.loc[df.url_type_txt.str.lower().str.contains('phish'), 'url_type']
103 ms ± 875 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [10]: %timeit df.loc[['phish' in u for u in df.url_type_txt], 'url_type']
10.7 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit df.loc[['phish' in u.lower() for u in df.url_type_txt], 'url_type']
19.3 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [12]: %timeit df.loc[['phish' in u for u in df.url_type_txt.str.lower()], 'url_type']
41.1 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dataframe：条件替换

2 个答案: