我想添加一个文字中每个单词的链接。
示例文字:
"He's <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
期望的结果:
"<a href='xxx.com?word=he'>He</a>'s
<i><a href='xxx.com?word=certain'>certain</a></i>
<a href='xxx.com?word=in'>in</a>
<a href='xxx.com?word=america'>America</a>'s
“<a href='xxx.com?word=west'>West</a>,”
<a href='xxx.com?word=it'>it</a>
<a href='xxx.com?word=could'>could</a>'ve
.... etc
(我将输出分成多行,以便在此处更容易阅读。实际输出应该是一个字符串,例如:
"<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?word=america'>America</a>'s “<a href='xxx.com?word=west'>West</a>,” <a href='xxx.com?word=it'>it</a> <a href='xxx.com?word=could'>could</a>'ve ... etc
每个单词都应该有一个链接,这个单词本身被剥去标点符号和收缩。链接是小写的。标点符号和收缩不应该得到链接。单词和标点符号是utf-8,带有许多Unicode字符。它将遇到的唯一html元素是<i>
和</i>
,因此它不是html解析,只是一个标记对。该链接应位于<i>
&lt; - &gt; </i>
标记内的字词上。
我的下面的代码适用于简单的测试用例,但是对于更长的并且有重复单词和<i>
标记的真实文本存在问题:
# -*- coding: utf-8 -*-
import re
def addLinks(s):
#adds a link to dictionary for every word in text
link = "xxx.com?word="
#strip out 's, 'd, 'l, 'm, 've, 're
#then split on punctuation
words = filter(None, re.split("[, \-!?:_;\"“”‘’‹›«»]+", re.sub("'[(s|d|l|m|(ve)|(re)]? ", " ", s)))
for w in words:
linkedWord = "<a href=#'" + link + w.lower() + "'>" + w + "</a>"
s = s.replace(w,linkedWord,1)
return s
s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
"""
print addLinks(s)
我的问题:
<i>
代码,例如将<i>certain</i>
变为<i><a href="xxx.com?word=certain">certain</a></i>
我在Python 2.7中执行此操作,但javascript的this answer类似且适用于Unicode,但不会解决我的问题,如标点符号。
答案 0 :(得分:1)
正则表达式可以帮助你。
要匹配任何长度的字词,您可以使用\w+
。要忽略单个标记<i>
和</i>
,您可以添加前瞻:(?!>)
。这将匹配打开和关闭标记。最后,要忽略收缩的右侧,您可以在匹配之前添加一个lookbehind:(?<!')
。
要插入找到的模式的小写版本,请使用回调函数(来自Using a regular expression to replace upper case repeated letters in python with a single lowercase letter)。回调lambda函数插入找到的匹配的小写版本,由<a=
代码包围,并立即构造整个替换字符串。
这导致我们
import re
s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights
to erupt at any time anywhere," he said holding a gun in his hand.
"""
callback = lambda pat: '<a href="xxx.com?word='+pat.group(1).lower()+'">'+pat.group(1)+'</a>'
result = re.sub(r"(?<!')(?!i>)(\w+)", callback, s)
result
最终会以
"<a href="xxx.com?word=i">I</a>'m <i><a href="xxx.com?word=certain">
certain</a></i> <a href="xxx.com?word=in">in</a> <a href="xxx.com?
word=america">America</a>'s "<a href="xxx.com?word=west">West</a>," ...