Question

我很难通过Python 3，BeautifulSoup 4

抓取这个链接

http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining

我只想了解这一部分。

When you are in ...

Capitol City Grille
This downtown Lansing restaurant offers ...

Capitol City Grille Lounge
For a glass of wine or a ...

Room Service
If you prefer ...

我有这段代码

 for rest in dining_page_soup.select("div.copy_left p strong"):

      if rest.next_sibling is not None:
         if rest.next_sibling.next_sibling is not None:
               title = rest.text
               desc = rest.next_sibling.next_sibling
               print ("Title:  "+title)
               print (desc)

但它给了我TypeError: 'NoneType' object is not callable

desc = rest.next_sibling.next_sibling即使我有一个if语句来检查它是否为None。

Answer 1

这是一个非常简单的解决方案

from bs4 import BeautifulSoup
import requests

r  = requests.get("http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining")
data = r.text
soup = BeautifulSoup(data)
for found_text in soup.select('div.copy_left'):
    print found_text.text

<强>更新

根据问题的改进，这里是使用RE的解决方案。必须针对第1段“当你......”进行具体的解决方法，因为它不尊重其他段落的结构。

for tag in soup.find_all(re.compile("^strong")):

    title = tag.text
    desc = tag.next_sibling.next_sibling
    print ("Title:  "+title)
    print (desc)

<强>输出

标题：Capitol City Grille

这家位于兰辛市中心的餐厅提供美味的现代菜肴   在高档而轻松的环境中享用美式菜肴。你可以享受   菜肴包括蓬松的煎饼和多汁的菲力牛排。   酒店提供自助早餐和午餐以及单点菜肴   菜单。

标题：Capitol City Grille Lounge

一杯葡萄酒或手工制作的鸡尾酒和精彩的对话，   在Capitol City Grille Lounge度过一个下午或晚上   朋友或同事。

标题：客房服务

如果您喜欢在自己舒适的房间内用餐，可以从酒店订购   客房服务菜单。

标题：菜单

早餐菜单

标题：国会城市格栅时数

早餐，早上6：30-11。

标题：国会城市格栅休息室时间
     星期四，上午11点到晚上11点

标题：客房服务时间

每日上午6:30至下午2:00和下午5-10点。

Answer 2

如果您不介意使用xpath，这应该可以正常工作

import requests
from lxml import html

url = "http://www.radisson.com/lansing-hotel-mi-48933/lansing/hotel/dining"
page = requests.get(url).text
tree = html.fromstring(page)

xp_t = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/text()"
xp_d = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()[not(following-sibling::strong)]"

titles = tree.xpath(xp_t)
descriptions = tree.xpath(xp_d)  # still contains garbage like '\r\n'
descriptions = [d.strip() for d in descriptions if d.strip()]

for t, d in zip(titles, descriptions):
    print("{title}: {description}".format(title=t, description=d))

这里的描述包含3个要素：＆＃34;这个市中心......＆＃34;，＆＃34;对于一杯...＆＃34;，＆＃34;如果您愿意...＆＃ 34。

如果您还需要＆＃34;当您心情愉快...＆＃34;时，请替换为：

xp_d = "//*[@class='copy_left']/descendant-or-self::node()/strong[not(following-sibling::a)]/../text()"

无法从网站上抓取特定内容 - BeautifulSoup 4

2 个答案: