在html文件中,发现人们插入情感标记很常见。通常,它看起来像这样:
<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>
如果只有一个这样的情感img,用它的情感标题替换它们很方便。例如:
def remove_single_img_tags(data):
p = re.compile(r'<img.*?/>')
img = re.findall(p, data)
emotion = img[0].split('title=')[1].split('/')[0]
return p.sub(emotion, data)
test1 = u'I love you.<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>.I hate bad men.'
remove_single_img_tags(test1)
但是,如果img html标签的形式有多个情感标记。这并不容易。
def remove_img_tags(data):
p = re.compile(r'<img.*?/>')
img = re.findall(p, data)
emotions = ()
for i in img:
emotion = i.split('title=')[1].split('/')[0]
emotions[i] = emotion
return p.sub(emotions, data)
test2 = u'I love you<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>I hate bad men <img alt="" border="0" class="inlineimg" src="images/smilies/mad.png" title="Mad"/>'
remove_img_tags(test2)
上面的python脚本不起作用。 TypeError: 'tuple' object does not support item assignment
答案 0 :(得分:2)
你的问题在这里:
emotions = ()
如果您将其更改为
emotions = []
然后改变
emotions[i] = emotion
要
emotions.append(emotion)
然后,改变
return p.sub(emotions, data)
要 return p.sub(元组(情绪),数据)
然后一切都会正常。
这是您更新的代码:
def remove_img_tags(data):
p = re.compile(r'<img.*?/>')
img = re.findall(p, data)
emotions = []
for i in img:
emotion = i.split('title=')[1].split('/')[0]
emotions.append(emotion)
return p.sub(tuple(emotions), data)
test2 = u'I love you<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>I hate bad men <img alt="" border="0" class="inlineimg" src="images/smilies/mad.png" title="Mad"/>'
remove_img_tags(test2)
>>> x = ()
>>> x[0] = 'hello'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> x = []
>>> x.append('hello')
>>> x
['hello']
>>>
答案 1 :(得分:1)
来自>>> help(re.sub)
:
Help on function sub in module re:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
您可以为将匹配作为参数的替换文本提供可调用对象,并返回替换文本。
>>> p = re.compile(r'<img.*?/>')
# repeat test string 5 times as input data
>>> data = '<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>' * 5
>>> p.sub(lambda match: match.group().split('title=')[1].split('/')[0], data)
'"Smile""Smile""Smile""Smile""Smile"'
编辑以下是其他示例:
>>> test1 = u'I love you.<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>.I hate bad men.' >>>
>>> p.sub(lambda match: match.group().split('title=')[1].split('/')[0], test1)
u'I love you."Smile".I hate bad men.'
>>> test2 = u'I love you<img alt="" border="0" class="inlineimg" src="images/smilies/smile.png" title="Smile"/>I hate bad men <img alt="" border="0" class="inlineimg" src="images/smilies/mad.png" title="Mad"/>'
>>> p.sub(lambda match: match.group().split('title=')[1].split('/')[0], test2)
u'I love you"Smile"I hate bad men "Mad"'
我还建议将标题匹配添加到正则表达式中,以便您可以按组索引提取它:
>>> p = re.compile(r'<img.*?title=(".*?")/>')
>>> p.sub(lambda match: match.group(1), test2)
u'I love you"Smile"I hate bad men "Mad"'