Question

我写了下面的代码试图刮掉谷歌学者页面

import requests as req
from bs4 import BeautifulSoup as soup

url = r'https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections'

session = req.Session()
content = session.get(url)
html2bs = soup(content.content, 'lxml')
gs_cit = html2bs.select('#gs_cit')
gs_citd = html2bs.find('div', {'id':"gs_citd"})
gs_cit1 =  html2bs.find('div', {'id':"gs_cit1"})

但是gs_citd只给了我这一行<div aria-live="assertive" id="gs_citd"></div>并且没有达到它下面的任何级别。此外，gs_cit1还会返回None。

出现在此图片中

我想达到突出显示的类，以便能够获取BibTeX引文。

请帮帮忙！

Answer 1

好的，所以我明白了。我使用selenium模块进行python创建虚拟浏览器，如果你愿意，你可以执行点击链接和获取结果HTML输出等操作。在解决这个问题时遇到了另一个问题，即必须加载页面，否则它只返回内容＆＃34;正在加载......＆＃34;在pop-up div中我使用python time模块到time.sleep(2) 2秒，这允许加载内容。然后我只使用BeautifulSoup解析生成的HTML输出，以找到带有类＆＃的锚标记34; gs_citi＆＃34 ;.然后从锚点中拉出href并将其放入请求中，请求＆＃34;请求＆＃34; python模块。最后，我将解码后的响应写入本地文件 - scholar.bib。

我在Mac上使用以下说明安装了chromedriver和selenium： https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f

然后由python文件签名，以允许使用这些说明停止防火墙问题： Add Python to OS X Firewall Options?

以下是我用来生成输出文件的代码＆＃34; scholar.bib＆＃34;：

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req

# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")

# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()

print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds

# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source

# We are done with the driver so quit.
driver.quit()

# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')

# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})

# Get the href attribute of the first anchor found
href = gs_citt['href']

print("Fetching: ", href)

# Instantiate a new requests session
session = req.Session()

# Get the response object of href
content = session.get(href)

# Get the content and then decode() it.
bibtex_html = content.content.decode()

# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
    file.writelines(bibtex_html)

希望这有助于任何人寻找解决方案。

Scholar.bib文件：

@article{arrow2013sustainability,
  title={Sustainability and the measurement of wealth: further reflections},
  author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
  journal={Environment and Development Economics},
  volume={18},
  number={4},
  pages={504--516},
  year={2013},
  publisher={Cambridge University Press}
}

Beautifulsoup没有达到儿童元素

1 个答案: