我在将网址解析为字符串方面遇到了一些麻烦。我需要检查url是否属于白名单中的域,但是检查是失败的。我想了解原因以及我的代码是否缺乏。
whitelist = []
whitelist_file = open(whitelist_file, 'r')
url = whitelist_file.readline()
for url in whitelist_file:
whitelist = whitelist + [str(url)]
whitelist_file.close()
test_file = open(test_file, 'r')
url_to_check = test_file.readlines()
for url in url_to_check:
for word in whitelist:
print(str(word), str(url), word in url)
print("-----")
这是上述内容的打印输出(因此您有已检查字符串的样本)。你可以看到a2a.eu失败了
a2a.eu
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
False
-----
ansa.it
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
False
-----
atlantia.it
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
False
-----
azimut-group.com
https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
False
-----
a2a.eu
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
False
-----
ansa.it
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
False
-----
atlantia.it
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
False
-----
azimut-group.com
https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
False
-----
a2a.eu
http://www.a2a.eu/en
False
-----
ansa.it
http://www.a2a.eu/en
False
-----
atlantia.it
http://www.a2a.eu/en
False
-----
azimut-group.com
http://www.a2a.eu/en
False
感谢
答案 0 :(得分:0)
首先,根据您输出的某些情况,此检查应产生True结果。这实际上只是根据输出打印来判断。我怀疑你的网址或单词(在whilelist中)不是你认为它们的字符串对象;尝试在print语句中转换为str
print(str(word), str(url), str(word) in str(url))
另外,您似乎只是要检查域名,看看urllib https://docs.python.org/3/library/urllib.html,您可以在其中将网址解析到域部分并检查它:
from urllib.parse import urlparse
print(str(word), str(url), str(word) in urlparse(str(url)).hostname
答案 1 :(得分:0)
第5行中的网址包含换行符。调用strip()并应该修复它:
whitelist = []
whitelist_file = open(whitelist_file, 'r')
url = whitelist_file.readline()
for url in whitelist_file:
whitelist = whitelist + [str(url.strip())]
whitelist_file.close()
test_file = open(test_file, 'r')
url_to_check = test_file.readlines()
for url in url_to_check:
for word in whitelist:
print(str(word), str(url), word in url)
print("-----")