Question

嘿所有，我正在使用beautifulsoup（在用scrapy两天挣扎失败后）刮掉星际争霸2的联盟数据，但是我遇到了一个问题。

我有这个表的结果我希望所有标签的字符串内容我喜欢这样：

from BeautifulSoup import *
from urllib import urlopen

def parseWithSoup(url):
    print "Reading:" , url
    html = urlopen(url).read().lower()
    bs = BeautifulSoup(html)
    table = bs.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="tblt_table") 
    rows = table.findAll(lambda tag: tag.name=='tr')

    rows.pop(0) #first row is header
    for row in rows:
        tags = row.findAll(lambda tag: tag.name=='a')
        content = []
        for tagcontent in tags:
            content.append(tagcontent.string)
        print content

if __name__ == '__main__':
    content = "http://www.teamliquid.net/tlpd/sc2-international/games#tblt-5018-1-1-DESC"
    metSoup = parseWithSoup(content)

然而输出如下：

[u'+', u'gadget show live i..', u'crevasse', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'metalopolis 1.1', u'naniwa', u'socke']
[u'+', u'gadget show live i..', u'shakuras plateau 2.0', u'socke', u'select']
etc...

我的问题是：你来自哪里（来自unicode？）我该如何删除？我只需要你所在的字符串......

Answer 1

u表示Unicode字符串。作为程序员，它不会改变任何东西，你应该忽略它。像普通的弦一样对待它们。你其实想要这个。

请注意，所有Beautiful Soup输出都是unicode。这是一件好事，因为如果你在抓取中遇到任何Unicode字符，你就不会有任何问题。如果确实想要摆脱u，（我不推荐它），您可以使用unicode字符串的decode()方法。

Answer 2

你看到的是Python unicode字符串。

查看Python文档

http://docs.python.org/howto/unicode.html

为了正确处理unicode字符串。

在python中使用beautifulsoup的输出

2 个答案: