为什么输出错误的unicode?

时间:2016-11-24 20:40:19

标签: python unicode beautifulsoup

我尝试将a附加到results,并且应该打印普通的http://链接。我希望能够打印出如此结果:results[:4] 我感谢任何帮助!谢谢!

这是代码:

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

results = []

def extract(soup):
 section = soup.find('section', {'class' : 'content left'})
 for post in section.findAll('article'):
   header = post.find('header', {'class' : 'loop-data'}) 
   a = header.findAll('a', href=True)
   for x in a:
    results.append(x.get('href'))
 print results

br = Browser()
url = "http://www.hotglobalnews.com/category/politics/"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)

这是我的结果:

[u'http://www.hotglobalnews.com/canada-just-legalized-heroin-to-control-     drug-addiction/', u'http://www.hotglobalnews.com/justin-trudeau-announces-deal-with-uber-uberweed/', u'http://www.hotglobalnews.com/donald-trump-to-legalize-marijuana-in-all-50-states/', u'http://www.hotglobalnews.com/obama-to-create-law-banning-words/', u'http://www.hotglobalnews.com/trudeau-says-trump-is-a-racist-bastard/', u'http://www.hotglobalnews.com/donald-trump-to-build-replica-of-guantanamo-bay-for-mexicans/', u'http://www.hotglobalnews.com/donald-trump-to-legalize-incest-marriages-if-elected/', u'http://www.hotglobalnews.com/justin-trudeau-to-build-statue-of-trudeau-in-2017/', u'http://www.hotglobalnews.com/donald-trump-muslims-invented-global-warming-to-destroy-u-s-economy/', u'http://www.hotglobalnews.com/isis-member-found-disguised-as-syrian-refugee-in-canada/', u'http://www.hotglobalnews.com/donald-trump-says-he-is-more-influential-than-martin-luther-king-jr/', u'http://www.hotglobalnews.com/obama-wears-fuck-trump-tshirt-to-white-house-barbecue/', u'http://www.hotglobalnews.com/donald-trump-says-he-could-shoot-somebody/', u'http://www.hotglobalnews.com/donald-trump-says-black-history-month-is-too-long/', u'http://www.hotglobalnews.com/justin-trudeau-to-ban-uber-in-canada/', u'http://www.hotglobalnews.com/justin-trudeau-accepts-comedy-central-new-years-roast/', u'http://www.hotglobalnews.com/donald-trumps-muslim-comment-disqualifies-him-from-presidency/', u'http://www.hotglobalnews.com/paris-terrorist-spotted-live-on-news-after-terror-attacks-on-paris/', u'http://www.hotglobalnews.com/anonymus-hacker-collective-declares-war-on-islamic-sate-group/', u'http://www.hotglobalnews.com/paris-attacks-over-100-killed-in-gunfire-and-blasts2/']

1 个答案:

答案 0 :(得分:0)

你得到的清单没有错。 u sigil告诉你字符串中的东西是Unicode,但那不是"错误"以任何方式。打印字符串将产生所需的结果(假设您的操作系统已正确配置为显示字符;对于看起来基本上是纯ASCII字符串的内容,这应该不是问题。)

Python 3在某种程度上改变了这些,但通常会更好。您仍然需要了解字节字符串和Unicode字符串之间的区别(至少如果您还需要使用字节字符串),但默认情况下所有字符串都是Unicode,这在当今时代很有意义。

https://nedbatchelder.com/text/unipain.html仍然是一个很好的起点,特别是如果你尚未过渡到Python 3。