在我在第3个代码块上输入2个if语句之前,我得到了几乎相同的错误,它无法连接str和Nonetype。
但是,当我在第3个if语句中取消注释print语句时,它会打印出一个带路径的URL列表。
我也在其他网站上试过这个,不仅仅是这个不起作用。
这是我的追溯
Traceback (most recent call last):
File "linkcrawler.py", line 24, in <module>
newurl = "http://" + b1 + b2
TypeError: cannot concatenate 'str' and 'NoneType' objects
Traceback (most recent call last):
File "linkcrawler.py", line 24, in <module>
newurl = "http://" + b1 + b2
TypeError: cannot concatenate 'str' and 'NoneType' objects
每次运行它我都会得到两个。
import urllib
from bs4 import BeautifulSoup
import traceback
import urlparse
import mechanize
url = "http://www.dailymail.co.uk/home/index.html"
br = mechanize.Browser()
urls = [url]
visited = [url]
while len(urls)>0:
try:
br.open(urls[0])
urls.pop(0)
for link in br.links():
newurl = urlparse.urljoin(link.base_url,link.url)
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
newurl = "http://"+b1+b2
if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
#print newurl
except:
traceback.print_exc()
urls.pop(0)
print visited
答案 0 :(得分:0)
b1
或b2
为None
。要解决此问题,请检查b1
和b2
是否为空或None
并重新构建代码:
b1 = urlparse.urlparse(newurl).hostname
b2 = urlparse.urlparse(newurl).path
if b1 and b2:
newurl = "http://"+b1+b2
if newurl not in visited and urlparse.urlparse(url).hostname in newurl:
urls.append(newurl)
visited.append(newurl)
#print newurl
else:
urls.pop(0)