我正在尝试抓取该网站的抽象部分:
from bs4 import BeautifulSoup
urlLink = 'https://www.cfapubs.org/doi/abs/10.2469/faj.v74.n4.2'
page_response = requests.get(page_link, timeout=5, verify=False, headers={'User-Agent': 'Mozilla/5.0'})
soup2 = BeautifulSoup(page_response.content, 'html.parser')
当我搜索时:
soup2.find_all("div", {"class": "abstractSection"})
我什么都没得到,这是我感兴趣的部分。 有想法吗?
答案 0 :(得分:1)
我不确定您在哪里找到可以使用的page_link
。尝试以下方法来获取您要解析的内容。
from bs4 import BeautifulSoup
import requests
urlLink = 'https://www.cfapubs.org/doi/abs/10.2469/faj.v74.n4.2'
page_response = requests.get(urlLink,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page_response.content, 'html.parser')
name = soup.find(class_="hlFld-ContribAuthor").find("a").text
abstract = soup.find(class_="abstractSection").find("p").text
print(f'Name : {name}\nAbstract : {abstract}')
如果要使用选择器,请尝试:
from bs4 import BeautifulSoup
import requests
urlLink = 'https://www.cfapubs.org/doi/abs/10.2469/faj.v74.n4.2'
page_response = requests.get(urlLink,headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(page_response.content, 'html.parser')
name = soup.select_one(".hlFld-ContribAuthor a").text
abstract = soup.select_one(".abstractSection p").text
print(f'Name : {name}\nAbstract : {abstract}')
输出:
Name : Charles D. Ellis, CFA
Abstract : One of the consequences of the shift in corporate retirement plans from defined benefit to defined contribution is widespread retirement insecurity. Although most people in the top one-third of economic affluence will be fine, for the other two-thirds—particularly the bottom one-third—the problem is a serious threat. We can prevent this painful future if we act sensibly and soon by raising the alarm with our corporate and government leaders.
最后,如果您不希望看到abstract
中文本之间的间隙,请用abstract = ' '.join(soup.find(class_="abstractSection").find("p").text.split())
替换该行。