Question

我在Vista 64位上使用Python.org版本2.7 64位来运行Scrapy。我正在尝试从这个网页上抓取一些文本，并设法清除大部分文本，删除换行符和HTML标记。但是，标签似乎仍包含在Command Shell的文本输出中：

u' British Grand Prix practice results ', u'

这来自以下网页：

http://www.bbc.co.uk/sport/0/formula1/28166984 上面的字符串表示指向另一个页面的超链接。我尝试使用以下正则表达式删除'u'标记，但它没有用：

body = response.xpath("//p").extract()
body2 = str(body)
body3 = re.sub(r'(\\[u]|\s){2,}', ' ', body2)

有人可以推荐一种方式或删除这些标签吗？另外，如果可能，您是否可以使用正则表达式删除两个标记之间的所有内容？

由于

Answer 1

u只是python信息，此文本以Unicode编码。

您必须以正确的方式打印文本才能在没有此信息的情况下获取文本。

a = [ u'hello', u'world' ]

print a

[u'hello', u'world']

for x in a:
    print x

hello
world

在你的情况下，body可能是一个字符串列表

print type(body)

这样做

body2 = ''

for x in body:
    body += x

print body2

甚至更好：

body2 = "".join(body)

print body2

Answer 2

正如furas所说，它只显示编码。默认情况下，2.7x使用ascii，因此当字符串在unicode中时，它用u表示。你可以使用unicode来回来编码（＆＃39; utf-8＆＃39;）

>>> a = 's'
>>> a
's'
>>> a = unicode('s')
>>> a
u's'
>>> a = a.encode('utf-8')
>>> a
's'

以下是如何使用列表

>>> ul = []
>>> ul.append(unicode('British Grand Prix practice results'))
>>> ul.append(unicode('some other string'))
>>> ul
[u'British Grand Prix practice results', u'some other string']
>>> l = []
>>> for s in ul:
...    l.append(s.encode('utf-8'))
...
>>> l
['British Grand Prix practice results', 'some other string']
>>>

使用Scrapy从文本中删除<u>字符</u>

2 个答案: