不能在BeautifulSoup中的get()函数中使用字节码数据?

时间:2017-01-19 11:26:24

标签: python beautifulsoup

在解析包含find_all()get()函数的网页时,我遇到了Beautiful Soup的问题,如下所示:

req=Request("http://www.some_site.com/page/2",headers={'User-Agent': 'Mozilla/5.0'})

page=urlopen(req).read()

soup = BeautifulSoup(page, 'html.parser')


for token in soup.find_all('a'):

    print(token.get('title'))

我收到错误:

  

UnicodeEncodeError:'charmap'编解码器无法对位置30中的字符'\ u2019'进行编码:字符映射到

问题来自这一行:

print(token.get('title'))

我不知道该怎么做,我可以在Beautiful源代码中更改get函数的代码,或者我可以删除网页中的所有unicode数据,但是只有使用漂亮的汤才有解决方案吗?

由于

2 个答案:

答案 0 :(得分:0)

import requests, bs4
req=requests.get("http://www.jazzadvice.com/page/2",headers={'User-Agent': 'Mozilla/5.0'})
page=req.text
soup = bs4.BeautifulSoup(page, 'html.parser')
for token in soup.find_all('a', title=True):

    print(token.get('title'))

出:

Login
Permanent Link to How to Play the Blues Like a Pro: A Lesson with Wynton Kelly
Permanent Link to The Talent Myth: Why Exceptional Musical Ability Is Within Your Reach
Permanent Link to The Beginner’s Guide to Jazz Articulation: Coltrane Techniques Demystified
Permanent Link to 4 Steps to Mastering the Solo Break: A Lesson With Clifford Brown
Permanent Link to How to Completely Change How You Think About Practicing: Words of Wisdom from Harold Mabern
Permanent Link to Killer Triadic & Pentatonic Concepts Made Easy: A Lesson With Kenny Garrett
Permanent Link to How Thinking Like a Writer Will Make You a Better Jazz Improvisor
Permanent Link to How to Learn Chord Changes Straight Off a Recording: A Handbook [Free Download]
Permanent Link to Why This Two-Step Approach to Jazz Language Will Take Your Improvising from Good to Great
Permanent Link to 2 Simple and Effective Practice Plans for Jazz Improvisation [Free Download]
October 10, 2011
April 7, 2011
April 20, 2011
January 26, 2014
November 24, 2010
April 13, 2014
June 22, 2011
November 20, 2011
March 20, 2012
April 14, 2010
June 15, 2010
April 10, 2013
April 29, 2011
December 8, 2010
June 3, 2010
December 17, 2010
April 18, 2010
December 22, 2010
February 18, 2011
June 9, 2011
May 27, 2010
July 16, 2010
January 7, 2011
November 9, 2010
July 7, 2010
May 17, 2010
May 18, 2010
February 3, 2011
January 19, 2011
November 12, 2010

答案 1 :(得分:0)

问题不在于get方法,而在print方面。 BeautifulSoup正确地将所有内容作为Unicode处理,但要打印结果,您(可能必须)以字节为单位转换unicode字符串。

根据您的系统,常用编码为Latin1,cp1252或utf8。假设utf8,你应该使用:

print(token.get('title').encode('utf8'))

不幸的是,单字节字符集如latin1或cp1252会更难,因为unicode字符U+2019(右引号:)在两者中都不存在。

您必须选择其中一个编解码器处理:

  • strict(默认值):引发UnicodeEncodeError
  • 替换:用'?'
  • 替换有问题的字符
  • 忽略:只是跳过违规字符

我的建议是首先翻译所有可能的字符。例如,右引号u'\u2019'可以替换为普通引用''\x27'。问题是我不知道那些简化的官方列表。我能提出的最好的建议是从足以处理该页面的内容开始,并使用类似的新字符增加翻译选项卡:

try:
    txt = token.get('title').encode('latin1')
except UnicodeEncodeError as e:
    # store the error message that will be used to increase the translation map
    # e.object contains the offending character
    txt = token.get('title').encode('latin1', 'ignore')  # or 'replace'
print(txt)