Question

我在python 2.7中使用urllib2，BeautifulSoup和topia.termextract模块来提取读取网站段落的术语

>>> extractor("he is Programmer, Visionary Entrepreneur and Investor ")
[('Entrepreneur', 1, 1), ('Programmer', 1, 1), ('Visionary', 1, 1), ('Investor', 1, 1), ('Visionary Entrepreneur', 1, 2)]

这适用于段落

但在下面的循环中

>>> def getTerms(website):
        page = urllib2.urlopen(website)
        text = page.read()
        soup = BeautifulSoup(text)

        for para in soup.findAll('p'):
            print extractor(para.text)

将网页网址传递到打印

上方的功能

[(u'Entrepreneur', 1, 1), (u'Programmer', 1, 1), (u'Visionary', 1, 1), (u'Investor', 1, 1), (u'Visionary Entrepreneur', 1, 2)] .....

在元组的开头打印u？我如何检索纯元组表单？

注意：只有打印para.text才会在上面的循环中打印纯文本段落

Answer 1

这些是Unicode字符串（因此是u''）格式。 'u'不是字符串的一部分，但表示其格式。

>>> s='abc'
>>> type(s)
<type 'str'>
>>> s=u'abc'
>>> type(s)
<type 'unicode'>

如果您正在与第三方网站打交道，则需要处理Unicode（因为您最终会遇到一个非美国英语的网站）。

请仔细阅读python文档的这一部分：https://docs.python.org/2/howto/unicode.html

或者更好的是，切换到Python 3，其中Unicode是字符串的默认格式。

提取术语打印扭曲的元组

1 个答案: