Question

我知道这个主题存在类似的问题但是我已经完成了它们但仍然无法得到它。

我的python程序使用正则表达式从页面中检索html的子部分。我刚刚意识到我没有考虑到阻碍html特殊字符的问题。

说我有：

regex_title = ['I went to the store', 'Itlt's a nice day today', 'I went home for a rest']

我显然希望将lt'更改为单引号＆＃39;。

我尝试过各种变体：

for each in regex_title:
    if 'lt&#039;' in regex_title:
        str.replace("lt&#039;", "'")

但没有成功。我错过了什么。

注意：目的是在不导入任何模块的情况下执行此操作。

Answer 1

str.replace不会就地替换。它返回替换的字符串。您需要分配返回值。

>>> regex_title = ['I went to the store', 'Itlt&#039;s a nice day today',
...                'I went home for a rest']
>>> regex_title = [s.replace("lt&#039;", "'") for s in regex_title]
>>> regex_title
['I went to the store', "It's a nice day today", 'I went home for a rest']

Answer 2

如果你的任务是unescape HTML，那么最好使用unescape函数：

>>> ll = ['I went to the store', 'Itlt&#039;s a nice day today', 'I went home for a rest']
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print map(h.unescape, ll)
['I went to the store', u"Itlt's a nice day today", 'I went home for a rest']

Answer 3

您需要将代码更改为：

for each in regex_title:
    if 'lt&#039;' in each:
        each.replace("lt&#039;", "'")

但它不会更改您的列表，因此您需要将替换的索引传递给列表：

>>> for each in regex_title:
...         if 'lt&#039;' in each:
...             regex_title[regex_title.index(each)]=each.replace("lt&#039;", "'")
... 
>>> regex_title
['I went to the store', "It's a nice day today", 'I went home for a rest']
>>>

Answer 4

您没有解释为什么要避免导入标准库模块。很少有理由否认自己使用Python的电池;除非你有这样的理由（如果你这样做，你应该说明），你应该使用提供给你的功能。

在这种情况下，它是html模块中的unescape()函数：¹

from html import unescape

titles = [
    'I went to the store',
    'It&#039;s a nice day today',
    'I went home for a rest'
]

fixed = [unescape(s) for s in titles]

>>> fixed
['I went to the store', "It's a nice day today", 'I went home for a rest']

自己重新实现html.unescape()

毫无意义的。
容易出错。
当您的数据中出现新的HTML实体时，意味着不断返回并添加新案例。

¹从Python 3.4开始，无论如何。对于以前的版本，请按@stalk's answer使用HTMLParser.HTMLParser.unescape()。

Answer 5

如https://stackoverflow.com/a/2087433/2314532所述，您最好不要自行使用HTMLParser库。阅读该问题并回答所有细节，但摘要是：

import HTMLParser
parser = HTMLParser.HTMLParser()
print parser.unescape('&#039;')
# Will print a single ' character

因此，在您的情况下，您希望执行以下操作：

import HTMLParser
parser = HTMLParser.HTMLParser()
new_titles = [parser.unescape(s) for s in regex_title]

这将取消任何 HTML转义，而不仅仅是您询问的'转义，并一次处理整个列表。

Answer 6

试试这样： -

 regex_title = ['I went to the store', 'Itlt&#039;s a nice day today', 'I went home for a rest']
 str=','.join(regex_title)
 str1=str.replace("lt&#039;","'");    
 print str1.split()

用Python替换列表中的部分字符串

6 个答案: