使用BeautifulSoup获取内部嵌套标记数据

时间:2016-09-22 11:40:38

标签: python-2.7 beautifulsoup

Image to a nest of tags

我想获取内部标记中的信息,但它会一直返回空白。这是我的代码:

import requests
from bs4 import BeautifulSoup

url = "http://www.krak.dk/cafe/s%C3%B8g.cs?consumer=suggest?search_word=cafe"
r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

genData = soup.find_all("ol", {"class": "hit-list"})
print genData
for infoX in genData:
    print inforX.text

我错过了什么?

2 个答案:

答案 0 :(得分:1)

html坏了,你需要一个不同的解析器,你可以使用 lxml

soup = BeautifulSoup(r.content, 'lxml')

或使用 html5lib

soup = BeautifulSoup(r.content, 'html5lib')

lxml 具有libxml等依赖关系, html5lib 可以用pip安装。

In [9]: url = "http://www.krak.dk/cafe/s%C3%B8g.cs?consumer=suggest?search_word=cafe"

In [10]: r = requests.get(url)
In [11]: soup = BeautifulSoup(r.content, 'html.parser')
In [12]: len(soup.find_all("ol", {"class": "hit-list"}))Out[12]: 0

In [13]: soup = BeautifulSoup(r.content, 'lxml')
In [14]: len(soup.find_all("ol", {"class": "hit-list"}))
Out[14]: 1

In [15]: soup = BeautifulSoup(r.content, 'html5lib')

In [16]: len(soup.find_all("ol", {"class": "hit-list"}))
Out[16]: 1

只有一个hit-list,因此您可以使用find代替find_all,您也可以使用id soup.find(id="hit-list")。如果你在w3c's html validator运行html,你可以看到有很多问题。

答案 1 :(得分:0)

问题在于字符编码utf-8。由于网页包括特殊的丹麦字符Åå,Øø,Ææ。谢谢Padraic,我不会注意到破碎的地址。

在第一行添加 - * - 编码:utf- 8 - * - 解决了问题。

- *- coding: utf- 8 - *-
import requests
from bs4 import BeautifulSoup

url = "http://www.krak.dk/cafe/søg.cs?consumer=suggest?search_word=cafe"
r = requests.get(url).content 
soup = BeautifulSoup(r, 'html5lib')