如何使用BeautifulSoup获取两个“div”背后的时间数据?
<div>
<div>
6:00.00
</div>
</div>
我尝试过以下代码
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.energystorageexchange.org/projects/2")
soup = BeautifulSoup(page.content, 'lxml')
rows = soup.select("div.div")
for r in rows:
print(r)
但这并不容易。
完整的HTML示例:
<div class='row'>
<hr class='border zeropadding zeromargin'>
<div class='col-md-6 zeropadding'>
<label class='new_font'>Duration at Rated Power (HH:MM)</label>
</div>
<div class='col-md-6 new_font'>
<div></div>
<div>
<div>
6:00.00
</div>
</div>
</div>
</hr>
</div>
<div class='row'>
<hr class='border zeropadding zeromargin'>
<div class='col-md-6 zeropadding new_font'>
<label class='new_font'>Weblink1</label>
</div>
<div class='col-md-6 new_font'>
<div>
<div class='show_value'>
<a href="http://www.gillsonions.com/node/192" target='_new' class='boldbluelink'>http://www.gillsonions.com/node/192</a>
</div>
</div>
来自https://www.energystorageexchange.org/projects/2
感谢您的帮助。
第二个问题:
我还希望从
中捕获千瓦的大小<input id='size_in_kw' type='hidden' value='1500'>
我试过这个,但似乎不完整:
value = soup.find('input', {'id': 'size_in_kw'}).get('value')
答案 0 :(得分:1)
div.div
选择器太模糊了。
因为,从它看来,您要获得“额定功率的持续时间(HH:MM)”字段值,我会首先找到相应的label
然后find the next文本节点匹配字段格式:
label = soup.find("label", text="Duration at Rated Power (HH:MM)")
value = label.find_next(text=re.compile(r"\d+:\d+")).strip()
print(value) # prints 6:00.00
(不要忘记导入re
模块)
答案 1 :(得分:1)
尝试这个以获得你想要刮擦的时间:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.energystorageexchange.org/projects/2")
soup = BeautifulSoup(page.content, 'lxml')
for item in soup.select("label.new_font"):
if "HH:MM" in item.text:
itemval = item.find_parent().find_next_sibling().text.strip()
print(itemval)
输出:
6:00.00
答案 2 :(得分:0)
关于你的第二个问题:
if "kW" in item.text:
itemval = item.find_parent().find_next_sibling().text.strip()
output.append(itemval)