我正在尝试将html解析为字典
我当前的代码中有很多逻辑。
闻起来很糟糕,我使用lxml来帮助我解析它。 任何推荐的方法来解析那种没有太多格式良好的DOM的html?
非常感谢
<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>
<p><strong>Flight duration:</strong> 3h 45m</p>
<p><strong>Operated by:</strong> NokScoot</p>
{
Departs: "5:15:00AM, Sat, Nov 28, 2015",
Arrives: "5:15:00AM, Sat, Nov 28, 2015",
Flight duration: "3h 45m"
...
}
doc_root = html.document_fromstring(resp.text)
for ele in doc_root.xpath('//ul[@class="tb_body"]'):
if has_stops(ele.xpath('.//li[@class="tb_body_flight"]//span[@class="has_cuspopup"]')):
continue
set_trace()
from_city = ele.xpath('.//li[@class="tb_body_city"]')[0]
set_trace()
sub_ele = ele.xpath('.//li[@class="tb_body_flight"]//span[@class="has_cuspopup"]')
set_trace()
答案 0 :(得分:0)
我为您提供的html创建了示例。它使用了流行的Beautiful Soup。
from bs4 import BeautifulSoup
data = '<p><strong>Departs:</strong> 5:15:00AM, Sat, Nov 28, 2015 - Taipei</p>\
<p><strong>Arrives:</strong> 8:00:00AM, Sat, Nov 28, 2015 - Bangkok - Don Mueang</p>\
<p><strong>Flight duration:</strong> 3h 45m</p>\
<p><strong>Operated by:</strong> NokScoot</p>'
soup = BeautifulSoup(data, 'html.parser')
res = {p.contents[0].text: p.contents[1].split(' - ')[0].strip() for p in soup.find_all('p')}
print(res)
输出:
{
'Departs:': '5:15:00AM, Sat, Nov 28, 2015',
'Flight duration:': '3h 45m',
'Operated by:': 'NokScoot',
'Arrives:': '8:00:00AM, Sat, Nov 28, 2015'
}
我认为如果你想让代码紧凑,你应该避免使用属性。