我只是python的新手,我很难在标记之间获取文本,这是完整表格的html。
<div id="menu">
<h4 style="display:none">Horse Photo</h4>
<ul style="margin-top:5px;border-radius:6px">
<li style="padding:0">
<img src="/images/unknown_horse.png" style="width:298px;margin-bottom:-3px;border-radius:5px;">
</li>
</ul>
<h4>Horse Profile</h4>
<ul>
<li>Age<span>3yo</span></li>
<li>Foaled<span>17/11/2014</span></li>
<li>Country<span>New Zealand</span></li>
<li>Location<span>Kembla Grange</span></li>
<li>Sex<span>Filly</span></li>
<li>Colour<span>Grey</span></li>
<li>Sire<span>Mastercraftsman</span></li>
<li>Dam<span>In Essence</span></li>
<li>Trainer
<span>
<a href="/trainer/26970-r-l-price/">R & L Price</a>
</span>
</li>
<li>Earnings<span>$19,795</span></li>
</ul>
<h4>Owners</h4>
<ul>
<li style="font:normal 12px 'Tahoma">Bell View Park Stud (Mgr: A P Mackrell)</li>
</ul>
</div>
答案 0 :(得分:1)
要解析HTML,请使用beautifulsoup
软件包。这样,您可以轻松选择html文档的元素。要打印<span>
标签内的所有文本,可以使用以下示例:
data = """
<div id="menu">
<h4 style="display:none">Horse Photo</h4>
<ul style="margin-top:5px;border-radius:6px">
<li style="padding:0">
<img src="/images/unknown_horse.png" style="width:298px;margin-bottom:-3px;border-radius:5px;">
</li>
</ul>
<h4>Horse Profile</h4>
<ul>
<li>Age<span>3yo</span></li>
<li>Foaled<span>17/11/2014</span></li>
<li>Country<span>New Zealand</span></li>
<li>Location<span>Kembla Grange</span></li>
<li>Sex<span>Filly</span></li>
<li>Colour<span>Grey</span></li>
<li>Sire<span>Mastercraftsman</span></li>
<li>Dam<span>In Essence</span></li>
<li>Trainer
<span>
<a href="/trainer/26970-r-l-price/">R & L Price</a>
</span>
</li>
<li>Earnings<span>$19,795</span></li>
</ul>
<h4>Owners</h4>
<ul>
<li style="font:normal 12px 'Tahoma">Bell View Park Stud (Mgr: A P Mackrell)</li>
</ul>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for li in soup.select('span'):
if li.text.strip() == '':
continue
print(li.text)
将打印:
3yo
17/11/2014
New Zealand
Kembla Grange
Filly
Grey
Mastercraftsman
In Essence
R & L Price
$19,795
答案 1 :(得分:0)
有很多使用HTML / XML的选项。我更喜欢parsel
软件包。您可以使用以下命令将其安装到您的环境中:
$ pip install parsel
之后,您可以像这样使用它:
from parsel import Selector
sel = Selector(html)
sel.css('ul li::text').extract()
# ['Age',
# 'Foaled',
# 'Country',
# 'Location',
# 'Sex',
# 'Colour',
# 'Sire',
# 'Dam',
# 'Trainer',
# 'Earnings',
# 'Bell View Park Stud (Mgr: A P Mackrell)']
可以找到更详细的描述here。