我正在尝试按日期将所有文本和链接提取到表中,到目前为止,只能获得一个条目(但链接名称不正确,因此无法正确输入)。我认为nextsibling
可能在这里有用,但这也许不是正确的解决方案。
这是html:
<ul class="indented">
<br>
<strong>May 15, 2019</strong>
<ul>
Sign up for more insight into FERC with our monthly news email, The FERC insight
<a href="/media/insight.asp">Read More</a>
</ul>
<br><br>
<strong>May 15, 2019</strong>
<ul>
FERC To Convene a Technical Conference regarding Columbia Gas Transmission, LLC on July 10, 2019
<a href="/CalendarFiles/20190515104556-RP19-763-000%20TC.pdf">Notice</a> <img src="/images/icon_pdf.gif" alt="PDF"> | <a href="/EventCalendar/EventDetails.aspx?ID=13414&CalType=%20&CalendarID=116&Date=07/10/2019&View=Listview">Event Details</a>
</ul>
<br><br>
这是我的代码:
import requests
from bs4 import BeautifulSoup
url1 = ('https://www.ferc.gov/media/headlines.asp')
r = requests.get(url1)
# Create a BeautifulSoup object
soup = BeautifulSoup(r.content, 'lxml')
# Pull headline text from the ul class indented
headlines = soup.find_all("ul", class_="indented")
headline = headlines[0]
date = headline.select_one('strong').text.strip()
print(date)
headline_text = headline.select_one('ul').text.strip()
print(headline_text)
headline_link = headline.select_one('ul a')["href"]
headline_link = 'https://www.ferc.gov' + headline_link
print(headline_link)
我得到了第一个日期,文本和链接,因为我正在使用select_one
。我需要获取所有链接并为每个日期正确命名。 findnext
是在这里还是findnextsibling
?
答案 0 :(得分:0)
我相信这就是您要寻找的;它获取日期,公告和相关链接:
[start same as your code; thru soup declaration]
dates = soup.find_all("strong")
for date in dates:
if '2019' in date.text:
print(date.text)
print(date.nextSibling.nextSibling.text)
for ref in date.nextSibling.nextSibling.find_all('a'):
new_link = "https://www.ferc.gov" + ref['href']
print(new_link)
print('=============================')
输出的随机部分:
May 15, 2019
FERC To Convene a Technical Conference regarding Columbia Gas Transmission, LLC on July 10, 2019
Notice
| Event Details
https://www.ferc.gov/CalendarFiles/20190515104556-RP19-763-000%20TC.pdf
https://www.ferc.gov/EventCalendar/EventDetails.aspx?ID=13414&CalType=%20&CalendarID=116&Date=07/10/2019&View=Listview
=============================