无法使用python2.7使用BeautifulSoup模块获取所有href链接

时间:2015-07-14 20:54:08

标签: python beautifulsoup

您好我正在使用以下python代码从网页获取所有网址链接:

from bs4 import BeautifulSoup
import urllib2

url='https://www.practo.com/delhi/dentist'
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

但是上面的代码无法获取所有链接,正如您在下面只能看到几个链接:

https://www.practo.com
/health/login
/for-doctors
javascript:void(0);
#
http://help.practo.com/practo-search/practo-relevance-algorithm
http://help.practo.com/practo-search/practo-ranking-algorithm
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara#services
https://www.practo.com
/health/login
/for-doctors
javascript:void(0);
#
http://help.practo.com/practo-search/practo-relevance-algorithm
http://help.practo.com/practo-search/practo-ranking-algorithm
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara?subscription_id=416433&specialization=Dentist&show_all=true
https://www.practo.com/delhi/clinic/prabhat-dental-care-shahdara#services

请有人可以帮助我为什么会发生这种情况,是否还有其他方法可以废弃所有链接....提前感谢

3 个答案:

答案 0 :(得分:0)

试试这个:

import urllib2
import re

url='https://www.practo.com/delhi/dentist?page=1'
resp = urllib2.urlopen(url)
s = resp.read()
regexp = r'http[^"]*"'
pattern = re.compile(regexp)
urls = re.findall(pattern, s)
for i in urls:
    print i

答案 1 :(得分:0)

这将返回该网站中的所有http个链接:

from BeautifulSoup import BeautifulSoup
import urllib2

url='https://www.practo.com/delhi/dentist'
resp = urllib2.urlopen(url)
soup = BeautifulSoup(resp)
for i in soup.findAll('a',href = True):
    link = i['href']
    if link[:4] == 'http':
        print link

答案 2 :(得分:0)

遇到同样的问题,并且可以通过将lxml中使用的解析器从html.parser更改为#!/usr/bin/python3 from bs4 import BeautifulSoup import urllib.request import http.server req = urllib.request.Request(url) try: with urllib.request.urlopen(req) as response: html = response.read() except urllib.error.HTTPError as e: errorMsg = http.server.BaseHTTPRequestHandler.responses[e.code][0] print("Cannot retrieve URL: {} : {}".format(str(e.code), errorMsg)) except urllib.error.URLError as e: print("Cannot retrieve URL: {}".format(e.reason)) except: print("Cannot retrieve URL: unknown error") soup = BeautifulSoup(html, "html.parser") for link in soup.find_all('a'): print("Link: {}".format(link['href'])) 来解决此问题:

    function createUser(cb){
            request(app)
                .post('/api/users')
                .send(user)
                .expect(200)
                .end(function(err, res){
                    if ( err ) throw err;
                    authToken = res.body.token;
                    cb();
                });
        };

您可以在BeautifulSoup下的文档中阅读有关不同解析器的更多信息。