我正在试图抓住一个网站并让一个部分让我感到困惑。组织提供了一个无序的位置列表,我似乎可以解析整个列表。
以下是HTML的示例:
<div id="current_tab">
<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<ul>
<li class="view_type_geoserved" id="view_field_geoserved">
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Granville (serves entire county)<span style="float: right; font-size: 0.8em;">Granville</span>
</p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Orange (serves entire county)<span style="float: right; font-size: 0.8em;">Orange</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Person (serves entire county)<span style="float: right; font-size: 0.8em;">Person</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Vance (serves entire county)<span style="float: right; font-size: 0.8em;">Vance</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Wake (serves entire county)<span style="float: right; font-size: 0.8em;">Wake</span></p>
</li>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Warren (serves entire county)<span style="float: right; font-size: 0.8em;">Warren</span></p>
</li>
</ul>
</div>
这就是我用来解析元素的原因
for i in soup.find('div', {'id':'current_tab'}).findAll('p'):
print i
这是我得到的结果,请注意它只是列表的开头:
<p class="view_label_type_geoserved" id="view_label_field_geoserved">Geographies Served</p>
<p style="font-weight: bold; border-bottom: 1px dotted #CCC; font-size: .9em;">North Carolina (NC)<span style="float: right; font-size: 0.8em;">North Carolina (NC)</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Durham (serves entire county)<span style="float: right; font-size: 0.8em;">Durham</span></p>
<p style="margin: 5px 0 3px 8px; border-bottom: 1px dotted #DDD; font-size:1em">Franklin (serves entire county)<span style="float: right; font-size: 0.8em;">Franklin</span></p>
一旦我收回HTML,我就会使用正则表达式删除文本,然后将它们连接成一个字符串,但是建议也会受到赞赏。
答案 0 :(得分:4)
问题是你正在处理的HTML需要一个宽松的解析器来解析。
使用lxml
或html5lib
:
soup = BeautifulSoup(data, 'html5lib') # or BeautifulSoup(data, 'lxml')
for p in soup.select('div#current_tab p'):
print p.text
适合我,打印:
Geographies Served
North Carolina (NC)North Carolina (NC)
Durham (serves entire county)Durham
Franklin (serves entire county)Franklin
Granville (serves entire county)Granville
Orange (serves entire county)Orange
Person (serves entire county)Person
Vance (serves entire county)Vance
Wake (serves entire county)Wake
Warren (serves entire county)Warren