使用beautifulsoup进行HTML解析可以在大多数URL上使用,但我想要的URL除外

时间:2019-02-03 21:30:42

标签: python python-3.x beautifulsoup

我正在尝试从以下URL解析搜索引擎提供的学术文献链接: https://www.sciencedirect.com/search?qs=hydrogen&show=25&sortBy=date&years=2018

我正在将beautifulsoup bs4与python 3配合使用,并且该代码适用于Wikipedia等多个测试URL,但是当我在上述URL上尝试使用该代码时,我仅从页眉和页脚中获得15个结果,而不是从> 100中包括搜索引擎的实际结果。

这是我要提取的HTML的示例:

<a href="/science/article/pii/S0360319918337960" 
class="result-list-title-link u-font-serif text-s" data-rank="1" 
data-docsubtype="fla" data-hack="#"><em>Hydrogen</em> integration in power-to-gas networks</a>

这是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = "https://www.sciencedirect.com/search?qs=hydrogen&show=25&sortBy=date&years=2018"
html = urlopen(url, context=ctx).read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
count = 0

for link in soup.find_all('a'):
    count += 1
    print(link.get('href'))

print(count)

任何想法为何?我开始怀疑该网站是否可以受fom解析器的保护。 非常感谢!

2 个答案:

答案 0 :(得分:2)

由于@ chitown88建议添加User-Agent,因此我想补充一点,您可以使用看起来像internal API的形式,即: https://www.sciencedirect.com/search/api?qs=hydrogen&show=25&sortBy=date&years=2018&navigation=true

那会快得多(当然,如果您的目标是拥有文章的URL s),那么您可能可以做类似的事情

...
r = requests.get('https://www.sciencedirect.com/search/api?qs=hydrogen&show=25&sortBy=date&years=2018&navigation=true')
data = r.json()
for result in data['searchResults']:
    print(result['pdf']['getAccessLink']
    ...

答案 1 :(得分:1)

我使用了请求,但是通过包含用户代理,您应该获得100多个链接。

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

url = "https://www.sciencedirect.com/search?qs=hydrogen&show=25&sortBy=date&years=2018"
html = requests.get(url, headers = headers)
soup = BeautifulSoup(html.text, "html.parser")

links = soup.find_all('a')

count = 0

for link in soup.find_all('a'):
    count += 1
    print(link.get('href'))

print(count)

输出:

#main_content
/
/browse/journals-and-books
/user/register?returnURL=%2Fsearch%3Fqs%3Dhydrogen%26show%3D25%26sortBy%3Ddate%26years%3D2018
/user/login?returnURL=%2Fsearch%3Fqs%3Dhydrogen%26show%3D25%26sortBy%3Ddate%26years%3D2018
https://service.elsevier.com/app/answers/detail/a_id/15904/supporthub/sciencedirect/
/browse/journals-and-books
/user/register?returnURL=%2Fsearch%3Fqs%3Dhydrogen%26show%3D25%26sortBy%3Ddate%26years%3D2018
/user/login?returnURL=%2Fsearch%3Fqs%3Dhydrogen%26show%3D25%26sortBy%3Ddate%26years%3D2018
https://service.elsevier.com/app/answers/detail/a_id/15904/supporthub/sciencedirect/
/search/advanced
/search?qs=hydrogen&show=25&sortBy=date
/search?qs=hydrogen&show=25&sortBy=date
/?qs=hydrogen&show=25&sortBy=relevance&years=2018
/search?qs=hydrogen&show=25&sortBy=relevance&years=2018
/science/article/pii/S0009250918305815
/science/journal/00092509
/science/article/pii/S0009250918305815
/science/article/pii/S0169433218321731
/science/journal/01694332
/science/article/pii/S0169433218321731
https://service.elsevier.com/app/answers/detail/a_id/27714/supporthub/sciencedirect/kw/register/
/science/article/pii/S0009250918303099
/science/journal/00092509
/science/article/pii/S0009250918303099
/science/article/pii/S0169433218322670
/science/journal/01694332
/science/article/pii/S0169433218322670
/science/article/pii/S0169433218321251
/science/journal/01694332
/science/article/pii/S0169433218321251
/science/article/pii/S1878535218302673
/science/journal/18785352
/science/article/pii/S1878535218302673/pdfft?md5=82c344dc5e6a16651e226289299ccd96&pid=1-s2.0-S1878535218302673-main.pdf
/science/article/pii/S0003267018309784
/science/journal/00032670
/science/article/pii/S0003267018309784/pdfft?md5=e84a9680b080d3521ae51a10a70e9b74&pid=1-s2.0-S0003267018309784-main.pdf
/science/article/pii/S0009250918306183
/science/journal/00092509
/science/article/pii/S0009250918306183
/science/article/pii/S0378775318311868
/science/journal/03787753
/science/article/pii/S0378775318311868
/science/article/pii/S0169433218322773
/science/journal/01694332
/science/article/pii/S0169433218322773
/science/article/pii/S0009250918305451
/science/journal/00092509
/science/article/pii/S0009250918305451
/science/article/pii/S0958694618302759
/science/journal/09586946
/science/article/pii/S0958694618302759
/science/article/pii/S0944711318306378
/science/journal/09447113
/science/article/pii/S0944711318306378
/science/article/pii/S0360319918338710
/science/journal/03603199
/science/article/pii/S0360319918338710
/science/article/pii/S109727651830981X
/science/journal/10972765
/science/article/pii/S109727651830981X
/science/article/pii/S0169433218323298
/science/journal/01694332
/science/article/pii/S0169433218323298
/science/article/pii/S0169433218322232
/science/journal/01694332
/science/article/pii/S0169433218322232
/science/article/pii/S0169433218322025
/science/journal/01694332
/science/article/pii/S0169433218322025
/science/article/pii/S0169433218335943
/science/journal/01694332
/science/article/pii/S0169433218335943
/science/article/pii/S1226086X18307378
/science/journal/1226086X
/science/article/pii/S1226086X18307378
/science/article/pii/S0169433218322372
/science/journal/01694332
/science/article/pii/S0169433218322372
/science/article/pii/S0009250918305980
/science/journal/00092509
/science/article/pii/S0009250918305980
/science/article/pii/S0169433218322955
/science/journal/01694332
/science/article/pii/S0169433218322955
/science/article/pii/S092058611831527X
/science/journal/09205861
/science/article/pii/S092058611831527X/pdfft?md5=f7c6523835be4ded224fbc28036d7218&pid=1-s2.0-S092058611831527X-main.pdf
/science/article/pii/S1878535218302661
/science/journal/18785352
/science/article/pii/S1878535218302661/pdfft?md5=2aa04be5459c3d92b5b8e7475b075146&pid=1-s2.0-S1878535218302661-main.pdf
/search?qs=hydrogen&show=50&sortBy=date&years=2018
/search?qs=hydrogen&show=100&sortBy=date&years=2018
/search?qs=hydrogen&show=25&sortBy=date&years=2018&offset=25
#
https://www.elsevier.com/
https://www.elsevier.com/solutions/sciencedirect
/customer/authenticate/manra
/science?_ob=ShoppingCartURL&_method=display&md5=3ff44acb300f01481824c54a2973d019
https://service.elsevier.com/app/contact/supporthub/sciencedirect/
https://www.elsevier.com/legal/elsevier-website-terms-and-conditions
https://www.elsevier.com/legal/privacy-policy
https://www.sciencedirect.com/legal/use-of-cookies
https://www.relx.com/
104

如果您仍然想使用urllib,只需进行小的修改即可:

from bs4 import BeautifulSoup
import ssl
from urllib.request import Request, urlopen

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}

url = "https://www.sciencedirect.com/search?qs=hydrogen&show=25&sortBy=date&years=2018"
req = Request(url, headers=headers)


html = urlopen(req, context=ctx).read().decode('utf-8')
soup = BeautifulSoup(html, "html.parser")
count = 0

for link in soup.find_all('a'):
    count += 1
    print(link.get('href'))

print(count)