我写了一个代码,只抓取以.ecm
结尾的超链接,这是我的代码
_URL='http://www.thehindu.com/archive/web/2017/08/08/'
r = requests.get(_URL)
soup = BeautifulSoup(r.text)
urls = []
names = []
newpath=r'D:\fyp\data set'
os.chdir(newpath)
name='testecmlinks'
for i, link in enumerate(soup.findAll('a')):
_FULLURL = _URL + link.get('href')
if _FULLURL.endswith('.ece'):
urls.append(_FULLURL)
names.append(soup.select('a')[i].attrs['href'])
names_urls = zip(names, urls)
for name, url in names_urls:
print url
rq = urllib2.Request(url)
res = urllib2.urlopen(rq)
pdf = open(name+'.txt', 'wb')
pdf.write(res.read())
pdf.close()
但是我收到以下错误
Traceback (most recent call last):
File "D:/fyp/scripts/test.py", line 18, in <module>
_FULLURL = _URL + link.get('href')
TypeError: cannot concatenate 'str' and 'NoneType' objects
你能帮助我获得以.ece
结尾的超链接吗?
答案 0 :(得分:1)
试试这个。我希望你能从该页面获得以ng build
结尾的所有超链接。
.ece
答案 1 :(得分:1)
该错误表示link.get('href')
的结果为None
。最好使用Beautiful Soup
循环中的for
来完成链接过滤。更改原始代码
...
for i, link in enumerate(soup.findAll('a')):
_FULLURL = _URL + link.get('href')
if _FULLURL.endswith('.ecm'):
urls.append(_FULLURL)
names.append(soup.select('a')[i].attrs['href'])
...
到此:
...
for i, link in enumerate(soup.find_all('a', href=re.compile(r'\.ece$'))):
...
答案 2 :(得分:0)
您可以获得更好的解决方案,但使用当前代码,您必须先检查link.get('href')
是否不是None
,然后才能添加到_URL
for link in soup.findAll('a'):
url = link.get('href') # get `href` or `None`
if url and url.endswith('.ece'): # check `None` and `.ece'
names_urls.append( _URL + url, url )
# ... or directly download file ...
# rq = urllib2.Request(_URL + url)
# res = urllib2.urlopen(rq)
# pdf = open(url+'.txt', 'wb')
# pdf.write(res.read())
# pdf.close()