我正在使用在熊猫df中看起来像这样的请求url数据集(字符串):
df
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 http://209.53.113.23/ 225211
3 https://googleads.g.doubleclick.net 6252
4 https://fls-na.amazon.com 65225
5 https://v10.vortex-win.data.microsoft.com 7852222
6 https://ib.adnxs.com 12
7 http://177.41.65.207/read.txt 188
所需的输出:
newdf
request_url count
0 https://login.microsoftonline.com 24521
1 https://dt.adsafeprotected.com 11521
2 https://googleads.g.doubleclick.net 6252
3 https://fls-na.amazon.com 65225
4 https://v10.vortex-win.data.microsoft.com 7852222
5 https://ib.adnxs.com 12
然后我将对数据使用tld库。我要摆脱这些原因的原因是因为tld库不知道如何处理域中的IP。有没有简单的方法可以从数据框中删除包含IP地址的行?
答案 0 :(得分:3)
创建一个函数来检查每一行并按结果过滤:
import re
def hasip(row):
return re.match(r"http://\d+\.\d+\.\d+\.\d+", row["request_url"]) is None
newdf = df[df.apply(hasip, axis=1)]
答案 1 :(得分:2)
您可以使用正则表达式findall
和[0-9]+(?:\.[0-9]+){3}
进行检查,astype
bool会将所有空列表转换为False
df[~df.request_url.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
Out[908]:
request_url
0 https://login.microsoftonline.com
1 https://dt.adsafeprotected.com
3 https://googleads.g.doubleclick.net
4 https://fls-na.amazon.com
5 https://v10.vortex-win.data.microsoft.com
6 https://ib.adnxs.com