使用python 3在html标签之间提取文本

时间:2018-08-14 14:16:55

标签: html for-loop beautifulsoup python-requests findall

我正在尝试抓取网站,我想提取链接的标题(“高级行政官员关于事实的新闻简报 在html标签之间输入的“加强美中经济关系表”。我正在使用的HTML源代码如下:

<h3 class="field-content"><a href="https://www.whitehouse.gov/the-press- 
office/2013/12/05/press-briefing-senior-administration-officials-fact-sheet- 
strengthening-">Press Briefing by Senior Administration Officials on the Fact 
Sheet on Strengthening U.S.-China Economic Relations</a></h3>

我的程序代码如下:

import requests
from bs4 import BeautifulSoup

url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

urls = []
for h in soup.find_all('h3'):
    a = h.find('a')
    urls.append(a.attrs['href'])
print(urls)

1 个答案:

答案 0 :(得分:1)

您可以使用.text属性来获取标签中包含的文本。我使用str.rsplit从标题中删除日期:

import requests
from bs4 import BeautifulSoup

url = 'http://stash.compjour.org/samples/webpages/whitehouse-press-briefings-page-50.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

for a in soup.select('h3 a[href]'):
    print(a.text.rsplit(',', maxsplit=1)[0])
    print(a['href'])
    print('-' * 80)

此打印:

Press Briefing by Press Secretary Jay Carney
https://www.whitehouse.gov/the-press-office/2013/12/06/press-briefing-press-secretary-jay-carney-1262013
--------------------------------------------------------------------------------
Daily Briefing by the Press Secretary
https://www.whitehouse.gov/the-press-office/2013/12/05/daily-briefing-press-secretary-1252013
--------------------------------------------------------------------------------
Press Briefing by Senior Administration Officials on the Fact Sheet on Strengthening U.S.-China Economic Relations
https://www.whitehouse.gov/the-press-office/2013/12/05/press-briefing-senior-administration-officials-fact-sheet-strengthening-
--------------------------------------------------------------------------------
Press Briefing by the Press Secretary
https://www.whitehouse.gov/the-press-office/2013/12/04/press-briefing-press-secretary-1232013
--------------------------------------------------------------------------------
Press Briefing by Press Secretary Jay Carney
https://www.whitehouse.gov/the-press-office/2013/12/02/press-briefing-press-secretary-jay-carney-1222013
--------------------------------------------------------------------------------
Press Gaggle by Principal Deputy Press Secretary Josh Earnest -- Los Angeles
https://www.whitehouse.gov/the-press-office/2013/11/26/press-gaggle-principal-deputy-press-secretary-josh-earnest-los-angeles-c
--------------------------------------------------------------------------------
Press Gaggle by Principal Deputy Press Secretary Josh Earnest Aboard Air Force One en route San Francisco
https://www.whitehouse.gov/the-press-office/2013/11/25/press-gaggle-principal-deputy-press-secretary-josh-earnest-aboard-air-fo
--------------------------------------------------------------------------------
Daily Briefing by the Press Secretary
https://www.whitehouse.gov/the-press-office/2013/11/22/daily-briefing-press-secretary-112213
--------------------------------------------------------------------------------
Briefing by Principal Deputy Press Secretary Josh Earnest
https://www.whitehouse.gov/the-press-office/2013/11/21/briefing-principal-deputy-press-secretary-josh-earnest-112113
--------------------------------------------------------------------------------
Press Briefing by Press Secretary Jay Carney
https://www.whitehouse.gov/the-press-office/2013/11/20/press-briefing-press-secretary-jay-carney-11192013
--------------------------------------------------------------------------------