Question

我在将网址解析为字符串方面遇到了一些麻烦。我需要检查url是否属于白名单中的域，但是检查是失败的。我想了解原因以及我的代码是否缺乏。

whitelist = []
whitelist_file = open(whitelist_file, 'r')
url = whitelist_file.readline()
for url in whitelist_file:
    whitelist = whitelist + [str(url)]
whitelist_file.close()

test_file = open(test_file, 'r')
url_to_check = test_file.readlines()

for url in url_to_check:
    for word in whitelist:
        print(str(word), str(url), word in url)
        print("-----")

这是上述内容的打印输出（因此您有已检查字符串的样本）。你可以看到a2a.eu失败了

a2a.eu
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
ansa.it
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
atlantia.it
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
azimut-group.com
 https://www.medgadget.com/2017/10/adenosine-a2a-receptor-antagonist-pipeline-insights-2017.html
 False
-----
a2a.eu
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
ansa.it
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
atlantia.it
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
azimut-group.com
 https://www.a2a.eu/en/2017-financial-calendar-a2a-spa
 False
-----
a2a.eu
 http://www.a2a.eu/en
 False
-----
ansa.it
 http://www.a2a.eu/en
 False
-----
atlantia.it
 http://www.a2a.eu/en
 False
-----
azimut-group.com
 http://www.a2a.eu/en
 False

感谢

Answer 1

首先，根据您输出的某些情况，此检查应产生True结果。这实际上只是根据输出打印来判断。我怀疑你的网址或单词（在whilelist中）不是你认为它们的字符串对象;尝试在print语句中转换为str

  print(str(word), str(url), str(word) in str(url))

另外，您似乎只是要检查域名，看看urllib https://docs.python.org/3/library/urllib.html，您可以在其中将网址解析到域部分并检查它：

  from urllib.parse import urlparse
  print(str(word), str(url), str(word) in urlparse(str(url)).hostname

Answer 2

第5行中的网址包含换行符。调用strip（）并应该修复它：

whitelist = []
whitelist_file = open(whitelist_file, 'r')
url = whitelist_file.readline()
for url in whitelist_file:
  whitelist = whitelist + [str(url.strip())]
  whitelist_file.close()

test_file = open(test_file, 'r')
url_to_check = test_file.readlines()

for url in url_to_check:
  for word in whitelist:
    print(str(word), str(url), word in url)
    print("-----")

检查网址（字符串）

2 个答案: