我在此类标签后立即收到中文文字时遇到问题。我查看了样本,这是迄今为止的代码。我不确定如何处理我的div
变量。 print div
给了我一个空白
from bs4 import BeautifulSoup
import requests
page = 'http://sbj.speiyou.com/search/index/subject:/grade:12/gtype:time'
r = requests.get(page)
r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)
div = soup.findAll('div', {"class" : 'pagination mtop40'})
print div
我已经尝试print div
,print div.text
print div.string
print div[0]
答案 0 :(得分:2)
只有一个这样的标签;所以请使用soup.find()
,而不是soup.findAll()
:
div = soup.find('div', class_='pagination')
包含文本元素和标签;要获取第一段文字,请使用.strings
or .stripped_strings
iterables;我更喜欢剥离的变体:
print next(div.stripped_strings, u'')
演示:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> page = 'http://sbj.speiyou.com/search/index/subject:/grade:12/gtype:time'
>>> r = requests.get(page)
>>> soup = BeautifulSoup(r.text)
>>> div = soup.find('div', class_='pagination')
>>> div
<div class="pagination mtop40">
当前第1/17页 【 首页 】 <span style="color:red;font-weight:bold;">1</span> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:2">2</a> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:3">3</a> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:4">4</a> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:5">5</a> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:6">6</a> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:7">7</a> <a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:8">8</a><a href="/search/index/grade:12/level:/subject:/gtype:time/service:/time:/term:/period:/o:da/bg:n/curpage:17">尾页</a> </div>
>>> print next(div.stripped_strings, u'')
当前第1/17页 【 首页 】
请注意,无需为该页面设置r.encoding
;无论如何,服务器都会在Content-Type
标题中为您提供编码。