这是html页面的一部分,我需要从中提取以下项目: 名称来自强标签,分类类型(演员和歌手),出生和死亡的位置。
<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>
我曾尝试使用BeautifulSoup,但我是python上的新手:
data2 = soup.find_all('li',{'class':'clearfix'})
for container in data2:
if container.find('a', {'class':'float-left'}):
name = container.a.text
print (name)
if container.find('a', {'class':'underline'}):
classification=container.div.p.a.text
print (classification)
flag
虽然我没有从脚本中得到任何错误,但我设法只提取了名称和第一个分类。如何定位我需要的其他元素:分类(“歌手”)以及出生和死亡的位置?
答案 0 :(得分:0)
你可以使用漂亮的汤作为html解析器,我首先用漂亮的汤向你展示,然后用正则表达式向你展示,并用群组捕获来捕捉结果:
首先是美味的汤:
string_1="""<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(string_1,"html.parser")
for a in soup.find_all('a'):
print(a['name'])
输出:
Elvis Presley
Mr. Elvis Presley
Actor
Singer
Tupelo
Memphis
第二个是正则表达式:
如果表单代码与您在那里显示的相同,请使用它:
import re
string_1="""<li class="clearfix">
<div style="margin-top:10px;">
<div class="float-left" style="margin-bottom:10px;">
<a href="http://" title="Elvis Presley" name="Elvis Presley" class="float-left">
<strong>Mr. Elvis Presley</strong></a>
</div>
<div class="rating_overall fleft" style="margin:0px 0px 0px 10px;">
<div class="rating_overall voted_rating_overall" style='width:72.96px;'></div>
</div>
<span class="result-vote float-left" id="result" style="line-height:15px; color: #AAA; font-size: 0.9em; margin-top: 1px;"> (15 vots)</span>
<div class="clear"></div>
<a href="http://" title="Mr. Elvis Presley" name="Mr. Elvis Presley">
<img style="float:left;" src="http://a.jpg" alt="Mr. Elvis Presley" title="Mr. Elvis Presley" />
</a>
<br/>
<p>
<b>Classification:</b>
<a href="http://" title="Actor " name="Actor " class="underline">Actor </a>
, <a href="" title="Singer" name="Singer" class="underline">Singer</a>
<br />
<b>Born:</b> <a href="http://" title="Tupelo" name="Tupelo" class="underline">Tupelo</a><br />
<b>Died:</b>
Memphis,
<!--<b>City:</b>-->
<a href="http://" title="Memphis" name="Memphis" class="underline">Memphis</a>
</p>
<div class="clk"></div>
</div>
</li>"""
pattern=r'<strong>(\w.+)<\/strong>|<b>Classification:<\/b>(\s.+)(\s.+)|(Born:.+)|(Died:.+\s.+\s.+\s.+)'
pattern_2=r'name=["](\w.+?)["]'
match=re.finditer(pattern,string_1,re.M)
for find in match:
if find.group(1):
print("Name {}".format(find.group(1)))
if find.group(2):
print("Classificiation first {}".format(re.search(pattern_2,str(find.group(2))).group(1)))
print("Classification second {}".format(re.search(pattern_2,str(find.group(3))).group(1)))
if find.group(4):
print("Born {}".format(re.search(pattern_2, str(find.group(4))).group(1)))
if find.group(5):
print("Dead {}".format(re.search(pattern_2, str(find.group(5))).group(1)))
输出:
Name Mr. Elvis Presley
Classificiation first Actor
Classification second Singer
Born Tupelo
Dead Memphis