Question

当网页抓取页面的某些元素时，我遇到了一些奇怪的角色。似乎给出错误的字符是：

？ ?? ??Á¢¢Á？ /？ /＆GT ;? / ??? ？/¢¥?? ?? %%？Á？ ????Á？？＆GT; / ???¥??＆GT; ¥？ ¥©Á？＆gt;¢¥/ %% /¥??＆gt; Â＆gt;Á？ Â？Á？???¢％Á？¥??? /％Á％Á？¥??＆gt; ?? /＆GT ;? Â??了？ ??¥?? ??¢¥????¥??＆GT; ¢`¢¢?? %%？Á??À？/？Á？ ¥？ _Á¥？＆gt; ??Á/¢？＆gt;ÀÁ????????????????????? /＆GT ;? ?? __？＆gt; ?? /¥??＆gt;¢？Á

我的代码如下

url= "http://www.nsf.gov#######@#@#@##";
    #webbrowser.open(url,new =new );
    flagcnt+=1
    if flagcnt%20==0: #autosleep for avoiding shut-out
        print "flagcount: "
        print flagcnt
        time.sleep(5)
     #Program Code extraction
    r = requests.get (url)
    sp=BeautifulSoup(r.content)

Page：http://www.nsf.gov/awardsearch

我读了这个错误的所有页面，其中一些建议解码和编码，但他们似乎没有帮助。我不知道这里使用的是哪种编码。已经降级BS版本但没有帮助。任何帮助表示赞赏。 Python 2.7 BS 4

Answer 1

这对我有用：

page_text = r.text.encode('utf-8').decode('ascii', 'ignore')
page_soupy = BeautifulSoup.BeautifulSoup(page_text)

Python Beautiful Soup'ascii'编解码器不能编码字符u'\ xa5'

1 个答案: