我有一个HTML文档,如下所示, self.soup 是BeautifulSoup对象。我试图在list元素中抓取数据。 List元素如下所示:
<ul class="list-group">
<li class="list-group-item">
<span class="strong">Name</span>
<span class="pull-right">Piter</span>
</li>
<li class="list-group-item">
<span class="strong">Year</span>
<span class="pull-right">2017</span>
</li>
</ul>
python文件 scrape.py
#person is a array
need = { 'Name' : 'name',
'Year' : 'year'
}
首先尝试
specs = self.soup.select("ul.list-group li.list-group-item")
if len(specs) > 0 :
for data in specs :
text = data.get_text()
if need.has_key( data[0].strip()) :
if need[ data[0].strip() ] not in person or person[ need[ data[0].strip() ] ] == '':
person[ need[ text[0].strip() ] ] = text[1].strip()
第一次错误
File "scraper.py", line 68, in scrape
if need.has_key( data[0].strip()) :
File "build/bdist.linux-x86_64/egg/bs4/element.py", line 1011, in__getitem__
KeyError: 0
第二次尝试
specs = self.soup.select("ul.list-group li.list-group-item")
if len(specs) > 0 :
for data in specs :
text = data.get_text()
if need.has_key( data[0].strip()) :
if need[ data[0].strip() ] not in person or person[ need[ data[0].strip() ] ] == '':
person[ need[ text[0].strip() ] ] = text[1].strip()
第二次错误
File "site_scrapers/v12software.scraper.py", line 66, in scrape
text = [ data.contents[0].get_text(), data.contents[1].get_text() ]
File "build/bdist.linux-x86_64/egg/bs4/element.py", line 737, in __getattr__
AttributeError: 'NavigableString' object has no attribute 'get_text'
我尝试将元素字符串放到 person 数组。
我需要结果如下:
print person['Name']
#output Piter
print person['Year']
#output 2017
答案 0 :(得分:3)
from bs4 import BeautifulSoup
html = """<ul class="list-group">
<li class="list-group-item">
<span class="strong">Name</span>
<span class="pull-right">Piter</span>
</li>
<li class="list-group-item">
<span class="strong">Year</span>
<span class="pull-right">2017</span>
</li>
</ul>"""
soup = BeautifulSoup(html, 'html.parser')
need = {}
for li_tag in soup.find_all('ul', {'class':'list-group'}):
for span_tag in li_tag.find_all('li', {'class':'list-group-item'}):
field = span_tag.find('span', {'class':'strong'}).text
value = span_tag.find('span', {'class':'pull-right'}).text
need[field] = value
print(need)