Question

我正在尝试抓取该网站的抽象部分：

from bs4 import BeautifulSoup
urlLink = 'https://www.cfapubs.org/doi/abs/10.2469/faj.v74.n4.2'
page_response = requests.get(page_link, timeout=5, verify=False, headers={'User-Agent': 'Mozilla/5.0'})
soup2 = BeautifulSoup(page_response.content, 'html.parser')

当我搜索时：

    soup2.find_all("div", {"class": "abstractSection"})

我什么都没得到，这是我感兴趣的部分。有想法吗？

Answer 1

我不确定您在哪里找到可以使用的page_link。尝试以下方法来获取您要解析的内容。

from bs4 import BeautifulSoup
import requests

urlLink = 'https://www.cfapubs.org/doi/abs/10.2469/faj.v74.n4.2'

page_response = requests.get(urlLink,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page_response.content, 'html.parser')
name = soup.find(class_="hlFld-ContribAuthor").find("a").text
abstract = soup.find(class_="abstractSection").find("p").text
print(f'Name : {name}\nAbstract : {abstract}')

如果要使用选择器，请尝试：

from bs4 import BeautifulSoup
import requests

urlLink = 'https://www.cfapubs.org/doi/abs/10.2469/faj.v74.n4.2'

page_response = requests.get(urlLink,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page_response.content, 'html.parser')
name = soup.select_one(".hlFld-ContribAuthor a").text
abstract = soup.select_one(".abstractSection p").text
print(f'Name : {name}\nAbstract : {abstract}')

输出：

Name : Charles D. Ellis, CFA
Abstract :  One of the consequences of the shift in corporate retirement plans from defined benefit           to defined contribution is widespread retirement insecurity. Although most people in the           top one-third of economic affluence will be fine, for the other two-thirds—particularly           the bottom one-third—the problem is a serious threat. We can prevent this painful           future if we act sensibly and soon by raising the alarm with our corporate and government           leaders.

最后，如果您不希望看到abstract中文本之间的间隙，请用abstract = ' '.join(soup.find(class_="abstractSection").find("p").text.split())替换该行。

网站部分未与BeautifulSoup一起出现

1 个答案: