我尝试了所有可以对页面进行编码然后使用BeautifulSoup。但是,当我运行时,它会显示unicode结果。任何人都可以帮我如何在BeautifulSoup下编码
我的代码:
import httplib
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser
headers={
'Host': 'digitalvita.pitt.edu',
'Connection': 'keep-alive',
'Origin': 'https://digitalvita.pitt.edu',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'text/javascript, text/html, application/xml, text/xml, */*',
'Referer': 'https://digitalvita.pitt.edu/index.php',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Cookie': 'PHPSESSID=lvetilatpgs9okgrntk1nvn595'
}
data={
'action':'search',
'xdata':'<search id="1"><context type="all" /><results><ordering>familyName</ordering><pagesize>100000</pagesize><page>1</page></results><terms><name>d</name><school>All</school></terms></search>',
'request':'search'
}
data = urllib.urlencode(data)
print data
req = urllib2.Request('https://digitalvita.pitt.edu/dispatcher.php', data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
htmlCodes = [
['&', '&'],
['<', '<'],
['>', '>'],
['"', '"'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()
def htmlEncode(s, codes=htmlCodes):
""" Returns the HTML encoded version of the given string. This is useful to
display a plain ASCII text string on a web page."""
for code in codes:
s = s.replace(code[1], code[0])
return s
s=htmlEncode(the_page,codes=htmlCodes)
h = HTMLParser.HTMLParser()
s=h.unescape(s)
s.encode("utf-8")
soup=BeautifulSoup(s,convertEntities=BeautifulSoup.HTML_ENTITIES)
print soup
简单的结果如下:
 <a href="#local" onclick="dvSearch.ToggleInterests(141432);"><span class="iToggle" id="toggle_141432">more...</span></a></span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Znati, Taieb</span><span class="email"> (<a href="mailto:znati@pitt.edu">znati@pitt.edu</a>) </span></div><div class="professionalPosition">Computer Science, University of Pittsburgh</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zoffer, H</span><span class="email"> (<a href="mailto:zoffer@pitt.edu">zoffer@pitt.edu</a>) </span></div><div class="professionalPosition">"KGSB-Dean, Office of", University of Pittsburgh</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zorn, Kristin</span><span class="email"> (<a href="mailto:kzorn@mail.magee.edu">kzorn@mail.magee.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zou, Chunbin</span><span class="email"> (<a href="mailto:chz4@pitt.edu">chz4@pitt.edu</a>) </span></div><div class="researchInterest"><b>Research Interests: </b>fatty liver disease; tyrosine kinase receptor; proteasome endopeptidase complex; phosphatidylcholines; trypanosome; Fas; ubiquitin; pulmonary surfactants; HGF/Met</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zou, Xiuying</span><span class="email"> (<a href="mailto:xiz42@pitt.edu">xiz42@pitt.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zrust, Marilyn</span><span class="email"> (<a href="mailto:zrustm@pitt.edu">zrustm@pitt.edu</a>) </span></div><div class="professionalPosition">Clinical Instructor, Acute/Tertiary Care, University of Pittsburgh School of Nursing</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zubieta, Juan</span><span class="email"> (<a href="mailto:zubietajc@upmc.edu">zubietajc@upmc.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zuccoli, Giulio</span><span class="email"> (<a href="mailto:giz3@pitt.edu">giz3@pitt.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zuckerman, Daniel</span><span class="email"> (<a href="mailto:ddmmzz@pitt.edu">ddmmzz@pitt.edu</a>) </span></div><div class="professionalPosition">Computational Biology, University of Pittsburgh</div><div class="researchInterest"><b>Research Interests: </b>structural biology; stochastic processes; computer simulation; coarse-grained models; protein dynamics and fluctuations; models, theoretical</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zuckoff, Allan</span><span class="email"> (<a href="mailto:zuckoffa@pitt.edu">zuckoffa@pitt.edu</a>) </span></div><div class="professionalPosition">Psychology, University of Pittsburgh</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zuckoff, Allan</span><span class="email"> (<a href="mailto:ZuckoffAM@UPMC.EDU">ZuckoffAM@UPMC.EDU</a>) </span></div><div class="professionalPosition">Psychiatry, University of Pittsburgh</div><div class="researchInterest"><b>Research Interests: </b>psychotherapy; substance-related disorders; motivational interviewing; grief treatment ; diagnosis, dual (psychiatry); treatment adherence; patient compliance; traumatic grief and substance abuse</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zukor, Tevya</span><span class="email"> (<a href="mailto:tez5@pitt.edu">tez5@pitt.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zuley, Margarita</span><span class="email"> (<a href="mailto:zuleyml@upmc.edu">zuleyml@upmc.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zunino, Paolo</span><span class="email"> (<a href="mailto:paz13@pitt.edu">paz13@pitt.edu</a>) </span></div><div class="professionalPosition">Mech Eng and Materials Sci, University of Pittsburgh</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zureikat, Amer</span><span class="email"> (<a href="mailto:zureikatah@upmc.edu">zureikatah@upmc.edu</a>) </span></div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zutter, Chad</span><span class="email"> (<a href="mailto:czutter@pitt.edu">czutter@pitt.edu</a>) </span></div><div class="professionalPosition">KGSB-Business Admin, University of Pittsburgh</div></td></tr></table></div><div><table width="100%" cellspacing="5" cellpadding="0"><tr valign="top"><td><img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /></td><td width="99%"><div><span class="name"> Zyczynski, Halina</span><span class="email"> (<a href="mailto:hzyczynski@mail.magee.edu">hzyczynski@mail.magee.edu</a>) </span></div><div class="professionalPosition">Obstetrics, Gynecology and Reproductive Sciences, University of Pittsburgh</div><div class="researchInterest"><b>Research Interests: </b>pelvic floor reconstruction; rectocele; uterine prolapse; sacralcolpopexy; bladder diseases; colpocleisis; pelic organ prolapse; urinary incontinence</div></td></tr></table></div>
]]&GT;
答案 0 :(得分:1)
看起来问题是你混淆了字符集。
我要做的第一件事就是更改你的Accept-Charset
,这样你才能接受utf-8。
'Accept-Charset': 'utf-8;q=0.7,*;q=0.3',
接下来,response.read()
的结果是一个8位字符串,您必须解码。既然我们现在知道它是utf-8,你可以这样做:
the_page = response.read().decode('utf-8')
通过这两项更改,当我运行您的脚本时,相同的片段将返回:
… Self Care</span>
<a href="#local" onclick="dvSearch.ToggleInterests(…
不再有垃圾Unicode字符。
当然这只能起作用,因为服务器愿意返回utf-8。对于更一般的情况,你有一些服务器只能做utf-8而另一些只能做Latin-1,你需要做一些更复杂的事情。单独保留Accept-Charset
标头,然后更改读取以查看响应标头。像这样:
response = urllib2.urlopen(req)
charset = response.info().getencoding()
the_page = response.read().decode(charset)
有许多配置不当的服务器实际上不会返回字符集,即使它们没有返回纯7位ASCII。在这种情况下,您需要检查服务器返回的内容并硬编码正确的答案,或编写代码以尝试动态检测正确的字符集。希望你永远不会遇到这种情况......