Question

我正在尝试构建一个抓取工具，以获取一些学术论文摘要及其相应的标题on this page。

问题是我的for link in bsObj.findAll('a',{'class':'search-track'})没有返回我进一步构建刮板所需的链接。在我的代码中，检查是这样的：

for link in bsObj.findAll('a',{'class':'search-track'}):
     print(link)

上面的for循环确实可以输出任何内容，但是href链接应位于<a class="search-track" ...</a>内。

我已经参考了this post，但是更改Beautifulsoup解析器并不能解决我的代码问题。我在Beautifulsoup构造函数"html.parser"中使用bsObj = bs(html.content, features="html.parser")。

print(len(bsObj))和"lxml"都打印出“ 3”，而"html5lib"则打印出“ 2”。

此外，我开始使用urllib.request.urlopen来获取页面，然后尝试使用requests.get()。不幸的是，这两种方法给了我相同的bsObj。

这是我编写的代码：

#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl


'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results? 
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"

journals = []
articles = []

def getJournals(url):
    global journals

    #html = urlopen(url)
    html = requests.get(url)
    bsObj = bs(html.content, features="html.parser")

    #print(len(bsObj))
    #testFile = open('testFile.txt', 'wb')
    #testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
    #testFile.close()

    for link in bsObj.findAll('a',{'class':'search-track'}):
        print(link) 
        ########does not print anything########
        '''
        if 'href' in link.attrs and link.attrs['href'] not in journals:
            newJournal = link.attrs['href']
            journals.append(newJournal)
        '''
    return None


# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

getJournals(address)
print(journals)

谁能告诉我代码中的问题是for循环未打印出任何链接？我需要将期刊的链接存储在列表中，然后访问每个链接以刮取论文摘要。正确的是，论文的摘要部分是免费的，网站不应该因此而阻止我的ID。

Answer 1

此页面动态加载了jscript，因此Beautifulsoup无法直接处理它。您也许可以使用Selenium来做到这一点，但在这种情况下，您可以通过跟踪页面（for more see, as one of many examples, here.

在您的特定情况下，可以通过以下方式完成：

from bs4 import BeautifulSoup as bs
import requests
import json

#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")


data = json.loads(str(soup))#response is in json format so we load it into a dictionary

注意：在这种情况下，也可以完全省去Beautifulsoup并直接加载响应，如data = json.loads(html.content)中所示。从这一点来看：

hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
    print(hit['_source']['url'])

输出：

https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x

等

Beautifulsoup“ findAll（）”不返回标签

1 个答案: