将网页抓取的结果分组

时间:2019-03-10 01:20:04

标签: python web-scraping beautifulsoup

在尝试学习如何使用Python进行网络抓取时,我已经从此http://bramatno8.kvartersmenyn.se/

获取了午餐菜单

页面构建如下:

<div class="menu">
<strong>Monday<br></strong>
<br>
Food 1<br>
Food 2
<br><br>
<strong>Tuesday<br></strong>
<br>
Food 3<br>
Food 4
<br><br>
<strong>Wednesday<br></strong>
<br>
Food 5<br>
Food 6
<br><br>
<strong>Thursday<br></strong>
<br>
Food 7<br>
Food 8
<br><br>
<strong>Friday<br></strong>
<br>
Food 9<br>
Food 10
<br><br>
</div>

所以到目前为止我是这样的:

import requests
from bs4 import BeautifulSoup

url = 'http://lunchmenu.com'

fetchlunch = requests.get(url)

soup = BeautifulSoup(fetchlunch.text, 'html.parser')

menu = soup.findAll(class_='menu')[0]

for br in menu.find_all('br'):
    br.replace_with('\n')

print(menu.get_text())

因此,这将在一个部分中打印一周的整个菜单。

我想做的只是获取菜单一天。也就是说,如果是星期二,则只应显示星期二的菜单。所以我想我需要将结果放入一个数组中,然后才能拉出当天的菜单?

1 个答案:

答案 0 :(得分:3)

一种方法是找到具有匹配日期内容的<strong>标签,然后使用.next_siblings遍历食物,直到您碰到另一个<strong>或用尽兄弟姐妹。我使用了lxml解析器,但是它也可以与html.parser一起使用。

这里是您的示例DOM中(我对食物进行了调整,以使其很明显):

import bs4
import requests

day = "Tuesday"
dom = """
<div class="menu">
<strong>Monday</strong>
<br>
Food 1<br>
Food 2
<br><br>
<strong>Tuesday</strong>
<br>
Food 3<br>
Food 4
<br><br>
<strong>Wednesday</strong>
<br>
Food 5<br>
Food 6
<br><br>
<strong>Thursday</strong>
<br>
Food 7<br>
Food 8
<br><br>
<strong>Friday</strong>
<br>
Food 9<br>
Food 10
<br><br>
</div>
"""

soup = bs4.BeautifulSoup(dom, "lxml")
menu = soup.find(class_ = "menu")
foods = []

for elem in menu.find("strong", text=day).next_siblings:
    if elem.name == "strong": 
        break

    if isinstance(elem, bs4.element.NavigableString) and elem.strip() != "":
        foods.append(elem.strip())

print(foods)

输出:

['Food 3', 'Food 4']

这里是第一个在线站点https://www.kvartersmenyn.se/rest/15494。注意扩展的字符编码和lambda可以使匹配工作,以防<b>标记中包含更多内容:

# -*- coding: latin1 -*-

import bs4
import requests

day = "Måndag"
url = "https://www.kvartersmenyn.se/rest/15494"

soup = bs4.BeautifulSoup(requests.get(url).text, "lxml")
menu = soup.find(class_ = "meny")
foods = []

for elem in menu.find("b", text = lambda x: day in x).next_siblings:
    if elem.name == "b": 
        break

    if isinstance(elem, bs4.element.NavigableString):
        foods.append(elem)

print(day)

for food in foods:
    print(food)

输出:

Måndag
A: Gaeng phed**
röd curry i cocosmjölk med sötbasilika, wokade blandade grönsaker
B: Ghai phad med mauang** (biff) wok i chilipaste med cashewnötter, grönsaker
C: Phad bamme (fläsk) wokade äggnudlar i ostronsås, grönsaker
D: Satay gay currymarinerade kycklingfiléspett med jordnötssås
E: Gai chup pheng tood*
Friterad kyckling med söt chilisås och ris
F: Phad bambou* (biff) wok i ostronsås med bambu, lök, champinjoner

最后,它在您的第二个在线站点http://bramatno8.kvartersmenyn.se/上 。所有这些站点都具有不同且不一致的结构,因此,是否存在针对所有站点的灵丹妙药并不明显。我怀疑这些菜单是由可能不了解文档结构的人手动编码的,因此需要花费一些工作来处理页面的任意更新。

在这里:

# -*- coding: latin1 -*-

import bs4
import requests

day = "Måndag"
url = "http://bramatno8.kvartersmenyn.se/"

soup = bs4.BeautifulSoup(requests.get(url).text, "lxml")
menu = soup.find(class_ = "meny")
foods = []

for elem in menu.find(text = day).parent.next_siblings:
    if elem.name == "strong": 
        break

    if isinstance(elem, bs4.element.NavigableString):
        foods.append(elem)

print(day)

for food in foods:
    print(food)

输出:

Måndag
Viltskav med rårörda lingon (eko), vaxbönor och potatispuré
Sesambakad blomkål med sojamarinerade böngroddar, salladslök, rädisa och sojabönor samt ris
相关问题