在尝试学习如何使用Python进行网络抓取时,我已经从此http://bramatno8.kvartersmenyn.se/
获取了午餐菜单页面构建如下:
<div class="menu">
<strong>Monday<br></strong>
<br>
Food 1<br>
Food 2
<br><br>
<strong>Tuesday<br></strong>
<br>
Food 3<br>
Food 4
<br><br>
<strong>Wednesday<br></strong>
<br>
Food 5<br>
Food 6
<br><br>
<strong>Thursday<br></strong>
<br>
Food 7<br>
Food 8
<br><br>
<strong>Friday<br></strong>
<br>
Food 9<br>
Food 10
<br><br>
</div>
所以到目前为止我是这样的:
import requests
from bs4 import BeautifulSoup
url = 'http://lunchmenu.com'
fetchlunch = requests.get(url)
soup = BeautifulSoup(fetchlunch.text, 'html.parser')
menu = soup.findAll(class_='menu')[0]
for br in menu.find_all('br'):
br.replace_with('\n')
print(menu.get_text())
因此,这将在一个部分中打印一周的整个菜单。
我想做的只是获取菜单一天。也就是说,如果是星期二,则只应显示星期二的菜单。所以我想我需要将结果放入一个数组中,然后才能拉出当天的菜单?
答案 0 :(得分:3)
一种方法是找到具有匹配日期内容的<strong>
标签,然后使用.next_siblings
遍历食物,直到您碰到另一个<strong>
或用尽兄弟姐妹。我使用了lxml
解析器,但是它也可以与html.parser
一起使用。
这里是您的示例DOM中(我对食物进行了调整,以使其很明显):
import bs4
import requests
day = "Tuesday"
dom = """
<div class="menu">
<strong>Monday</strong>
<br>
Food 1<br>
Food 2
<br><br>
<strong>Tuesday</strong>
<br>
Food 3<br>
Food 4
<br><br>
<strong>Wednesday</strong>
<br>
Food 5<br>
Food 6
<br><br>
<strong>Thursday</strong>
<br>
Food 7<br>
Food 8
<br><br>
<strong>Friday</strong>
<br>
Food 9<br>
Food 10
<br><br>
</div>
"""
soup = bs4.BeautifulSoup(dom, "lxml")
menu = soup.find(class_ = "menu")
foods = []
for elem in menu.find("strong", text=day).next_siblings:
if elem.name == "strong":
break
if isinstance(elem, bs4.element.NavigableString) and elem.strip() != "":
foods.append(elem.strip())
print(foods)
输出:
['Food 3', 'Food 4']
这里是第一个在线站点https://www.kvartersmenyn.se/rest/15494。注意扩展的字符编码和lambda可以使匹配工作,以防<b>
标记中包含更多内容:
# -*- coding: latin1 -*-
import bs4
import requests
day = "Måndag"
url = "https://www.kvartersmenyn.se/rest/15494"
soup = bs4.BeautifulSoup(requests.get(url).text, "lxml")
menu = soup.find(class_ = "meny")
foods = []
for elem in menu.find("b", text = lambda x: day in x).next_siblings:
if elem.name == "b":
break
if isinstance(elem, bs4.element.NavigableString):
foods.append(elem)
print(day)
for food in foods:
print(food)
输出:
Måndag
A: Gaeng phed**
röd curry i cocosmjölk med sötbasilika, wokade blandade grönsaker
B: Ghai phad med mauang** (biff) wok i chilipaste med cashewnötter, grönsaker
C: Phad bamme (fläsk) wokade äggnudlar i ostronsås, grönsaker
D: Satay gay currymarinerade kycklingfiléspett med jordnötssås
E: Gai chup pheng tood*
Friterad kyckling med söt chilisås och ris
F: Phad bambou* (biff) wok i ostronsås med bambu, lök, champinjoner
最后,它在您的第二个在线站点http://bramatno8.kvartersmenyn.se/上 。所有这些站点都具有不同且不一致的结构,因此,是否存在针对所有站点的灵丹妙药并不明显。我怀疑这些菜单是由可能不了解文档结构的人手动编码的,因此需要花费一些工作来处理页面的任意更新。
在这里:
# -*- coding: latin1 -*-
import bs4
import requests
day = "Måndag"
url = "http://bramatno8.kvartersmenyn.se/"
soup = bs4.BeautifulSoup(requests.get(url).text, "lxml")
menu = soup.find(class_ = "meny")
foods = []
for elem in menu.find(text = day).parent.next_siblings:
if elem.name == "strong":
break
if isinstance(elem, bs4.element.NavigableString):
foods.append(elem)
print(day)
for food in foods:
print(food)
输出:
Måndag
Viltskav med rårörda lingon (eko), vaxbönor och potatispuré
Sesambakad blomkål med sojamarinerade böngroddar, salladslök, rädisa och sojabönor samt ris