我正在尝试抓取网站,我想提取链接的标题(“高级行政官员关于事实的新闻简报 在html标签之间输入的“加强美中经济关系表”。我正在使用的HTML源代码如下:
<h3 class="field-content"><a href="https://www.whitehouse.gov/the-press-
office/2013/12/05/press-briefing-senior-administration-officials-fact-sheet-
strengthening-">Press Briefing by Senior Administration Officials on the Fact
Sheet on Strengthening U.S.-China Economic Relations</a></h3>
我的程序代码如下:
import requests
from bs4 import BeautifulSoup
url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
urls = []
for h in soup.find_all('h3'):
a = h.find('a')
urls.append(a.attrs['href'])
print(urls)
答案 0 :(得分:1)
您可以使用.text
属性来获取标签中包含的文本。我使用str.rsplit
从标题中删除日期:
import requests
from bs4 import BeautifulSoup
url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
for a in soup.select('h3 a[href]'):
print(a.text.rsplit(',', maxsplit=1)[0])
print(a['href'])
print('-' * 80)
此打印:
Press Briefing by Press Secretary Jay Carney
https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013
--------------------------------------------------------------------------------
Daily Briefing by the Press Secretary
https://www.whitehouse.gov/the-press-office/2013/12/05/daily-briefing-press-secretary-1252013
--------------------------------------------------------------------------------
Press Briefing by Senior Administration Officials on the Fact Sheet on Strengthening U.S.-China Economic Relations
https://www.whitehouse.gov/the-press-office/2013/12/05/press-briefing-senior-administration-officials-fact-sheet-strengthening-
--------------------------------------------------------------------------------
Press Briefing by the Press Secretary
https://www.whitehouse.gov/the-press-office/2013/12/04/press-briefing-press-secretary-1232013
--------------------------------------------------------------------------------
Press Briefing by Press Secretary Jay Carney
https://www.whitehouse.gov/the-press-office/2013/12/02/press-briefing-press-secretary-jay-carney-1222013
--------------------------------------------------------------------------------
Press Gaggle by Principal Deputy Press Secretary Josh Earnest -- Los Angeles
https://www.whitehouse.gov/the-press-office/2013/11/26/press-gaggle-principal-deputy-press-secretary-josh-earnest-los-angeles-c
--------------------------------------------------------------------------------
Press Gaggle by Principal Deputy Press Secretary Josh Earnest Aboard Air Force One en route San Francisco
https://www.whitehouse.gov/the-press-office/2013/11/25/press-gaggle-principal-deputy-press-secretary-josh-earnest-aboard-air-fo
--------------------------------------------------------------------------------
Daily Briefing by the Press Secretary
https://www.whitehouse.gov/the-press-office/2013/11/22/daily-briefing-press-secretary-112213
--------------------------------------------------------------------------------
Briefing by Principal Deputy Press Secretary Josh Earnest
https://www.whitehouse.gov/the-press-office/2013/11/21/briefing-principal-deputy-press-secretary-josh-earnest-112113
--------------------------------------------------------------------------------
Press Briefing by Press Secretary Jay Carney
https://www.whitehouse.gov/the-press-office/2013/11/20/press-briefing-press-secretary-jay-carney-11192013
--------------------------------------------------------------------------------