仅使用beautifulsoup从html页面中删除以.ece结尾的超链接

时间:2018-01-08 07:33:16

标签: python html web-scraping beautifulsoup

我写了一个代码,只抓取以.ecm结尾的超链接,这是我的代码

_URL='http://www.thehindu.com/archive/web/2017/08/08/'
r = requests.get(_URL)
soup = BeautifulSoup(r.text)
urls = []
names = []
newpath=r'D:\fyp\data set'
os.chdir(newpath)
name='testecmlinks'
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.ece'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])

names_urls = zip(names, urls)

for name, url in names_urls:
    print url
    rq = urllib2.Request(url)
    res = urllib2.urlopen(rq)
    pdf = open(name+'.txt', 'wb')
    pdf.write(res.read())
    pdf.close()

但是我收到以下错误

Traceback (most recent call last):
  File "D:/fyp/scripts/test.py", line 18, in <module>
    _FULLURL = _URL + link.get('href')
TypeError: cannot concatenate 'str' and 'NoneType' objects

你能帮助我获得以.ece结尾的超链接吗?

3 个答案:

答案 0 :(得分:1)

试试这个。我希望你能从该页面获得以ng build结尾的所有超链接。

.ece

答案 1 :(得分:1)

该错误表示link.get('href')的结果为None。最好使用Beautiful Soup循环中的for来完成链接过滤。更改原始代码

...
for i, link in enumerate(soup.findAll('a')):
    _FULLURL = _URL + link.get('href')
    if _FULLURL.endswith('.ecm'):
        urls.append(_FULLURL)
        names.append(soup.select('a')[i].attrs['href'])
...

到此:

...
for i, link in enumerate(soup.find_all('a', href=re.compile(r'\.ece$'))):
...

答案 2 :(得分:0)

您可以获得更好的解决方案,但使用当前代码,您必须先检查link.get('href')是否不是None,然后才能添加到_URL

for link in soup.findAll('a'):
    url = link.get('href')  # get `href` or `None`
    if url and url.endswith('.ece'): # check `None` and `.ece'
        names_urls.append( _URL + url, url )
        # ... or directly download file ...
        # rq = urllib2.Request(_URL + url)
        # res = urllib2.urlopen(rq)
        # pdf = open(url+'.txt', 'wb')
        # pdf.write(res.read())
        # pdf.close()