我正在尝试从站点https://www.cellartracker.com/m/wines/12344抓取一些数据。我不明白如何获取不属于标签中任何类的每个值。以下是我要查找的网站代码:
<ul class="twin-set-list">
<li><span>Vintage</span> 2000</li>
<li><span>Type</span> Red</li>
<li><span>Producer</span> Balnaves of Coonawarra</li>
<li><span>Varietal</span> Cabernet Sauvignon</li>
<li><span>Designation</span> The Tally Reserve</li>
<li><span>Vineyard</span> n/a</li>
<li><span>Country</span> Australia</li>
<li><span>Region</span> South Australia</li>
<li><span>SubRegion</span> Limestone Coast</li>
<li><span>Appellation</span> Coonawarra</li>
</ul>
像2000,Red等之类的值没有任何类,因此我可以用来获取数据的方式是什么。我在python中尝试了以下代码(下面仅给出了html部分):
from bs4 import BeautifulSoup
html = """<ul class="twin-set-list">
<li><span>Vintage</span> 2000</li>
<li><span>Type</span> Red</li>
<li><span>Producer</span> Balnaves of Coonawarra</li>
<li><span>Varietal</span> Cabernet Sauvignon</li>
<li><span>Designation</span> The Tally Reserve</li>
<li><span>Vineyard</span> n/a</li>
<li><span>Country</span> Australia</li>
<li><span>Region</span> South Australia</li>
<li><span>SubRegion</span> Limestone Coast</li>
<li><span>Appellation</span> Coonawarra</li>
</ul>"""
soup = BeautifulSoup(html, 'html.parser')
need = {}
for li_tag in soup.find_all('ul', {'class':'twin-set-list'}):
for span_tag in li_tag.find_all('li'):
field = span_tag.find('span').text
value = span_tag.find('span').text
need[field] = value
print(need)
有人可以建议我如何提取数据吗?
答案 0 :(得分:1)
您可以遍历contents
对象的bs4
属性:
from bs4 import BeautifulSoup as soup
d = [[getattr(c, 'text', c).strip() for c in i] for i in soup(html, 'html.parser').find_all('li')]
输出:
[['Vintage', '2000'], ['Type', 'Red'], ['Producer', 'Balnaves of Coonawarra'], ['Varietal', 'Cabernet Sauvignon'], ['Designation', 'The Tally Reserve'], ['Vineyard', 'n/a'], ['Country', 'Australia'], ['Region', 'South Australia'], ['SubRegion', 'Limestone Coast'], ['Appellation', 'Coonawarra']]
答案 1 :(得分:1)
您可以将代码替换为:
field = span_tag.find('span').text
value = span_tag.text.replace(field,'')
它不是很干净,但是可以与您的代码一起使用。
答案 2 :(得分:0)
也许您可以尝试以下方法:
for li_tag in soup.find_all('ul', {'class':'twin-set-list'}):
for span_tag in li_tag.find_all('li'):
field = span_tag.find('span').text
value = span_tag.text
value = value[len(field)+1:]
need[field] = value
以防万一,如果您在“值”中具有相同的字段,则不要替换它,而应使用subtring。