Question

我想添加一个文字中每个单词的链接。

示例文字：
"He's certain in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.

期望的结果：

"<a href='xxx.com?word=he'>He</a>'s
 <i><a href='xxx.com?word=certain'>certain</a></i>
 <a href='xxx.com?word=in'>in</a>
 <a href='xxx.com?word=america'>America</a>'s
 “<a href='xxx.com?word=west'>West</a>,”
 <a href='xxx.com?word=it'>it</a>
 <a href='xxx.com?word=could'>could</a>'ve
.... etc

（我将输出分成多行，以便在此处更容易阅读。实际输出应该是一个字符串，例如：

 "<a href='xxx.com?word=he'>He</a>'s <i><a href='xxx.com?word=certain'>certain</a></i> <a href='xxx.com?word=in'>in</a> <a href='xxx.com?word=america'>America</a>'s “<a href='xxx.com?word=west'>West</a>,” <a href='xxx.com?word=it'>it</a> <a href='xxx.com?word=could'>could</a>'ve ... etc

每个单词都应该有一个链接，这个单词本身被剥去标点符号和收缩。链接是小写的。标点符号和收缩不应该得到链接。单词和标点符号是utf-8，带有许多Unicode字符。它将遇到的唯一html元素是和，因此它不是html解析，只是一个标记对。该链接应位于＆lt; - ＆gt; 标记内的字词上。

我的下面的代码适用于简单的测试用例，但是对于更长的并且有重复单词和标记的真实文本存在问题：

# -*- coding: utf-8 -*-
import re

def addLinks(s):
    #adds a link to dictionary for every word in text
    link = "xxx.com?word="

    #strip out 's, 'd, 'l, 'm, 've, 're
    #then split on punctuation
    words = filter(None, re.split("[, \-!?:_;\"“”‘’‹›«»]+",  re.sub("'[(s|d|l|m|(ve)|(re)]? ", " ", s)))
    for w in words:
        linkedWord = "<a href=#'" + link + w.lower() + "'>" + w + "</a>"
        s = s.replace(w,linkedWord,1)
    return s

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights to erupt at any time anywhere," he said holding a gun in his hand.
"""
print addLinks(s)

我的问题：

如何处理句子中重复的单词，要么是精确重复（＆＃34;＆＃34;＆lt; - ＆gt;＆＃34; in＆＃34;），要么是标点符号和/或大小写（＆＃ 34;他＆＃34;＆lt; - ＆gt;＆＃34;他＆＃34;），或部分词（＆＃34;枪＆＃34;＆lt; - ＆gt;＆＃34;枪战＆＃34; ，＆＃34;任何＆＃34;＆lt; - ＆gt;＆＃34;任何地方，＆＃34;）。如果它在空格上完全分开会更容易，但是通过剥离收缩然后拆分标点符号，我无法弄清楚如何干净地将链接的单词替换回字符串。
我摆脱收缩的正则表达式适用于单个字母，例如＆＃39; m和＆d; d，但不适用于＆＃39;和
我无法弄清楚如何处理代码，例如将certain变为<a href="xxx.com?word=certain">certain</a>

我在Python 2.7中执行此操作，但javascript的this answer类似且适用于Unicode，但不会解决我的问题，如标点符号。

Answer 1

正则表达式可以帮助你。

要匹配任何长度的字词，您可以使用\w+。要忽略单个标记和，您可以添加前瞻：(?!>)。这将匹配打开和关闭标记。最后，要忽略收缩的右侧，您可以在匹配之前添加一个lookbehind：(?<!')。

要插入找到的模式的小写版本，请使用回调函数（来自Using a regular expression to replace upper case repeated letters in python with a single lowercase letter）。回调lambda函数插入找到的匹配的小写版本，由<a=代码包围，并立即构造整个替换字符串。

这导致我们

import re

s = """
"I'm <i>certain</i> in America's “West,” it could’ve been possible for gunfights
to erupt at any time anywhere," he said holding a gun in his hand.
"""

callback = lambda pat: '<a href="xxx.com?word='+pat.group(1).lower()+'">'+pat.group(1)+'</a>'
result = re.sub(r"(?<!')(?!i>)(\w+)", callback, s)

result最终会以

结尾

"<a href="xxx.com?word=i">I</a>'m <i><a href="xxx.com?word=certain">
certain</a></i> <a href="xxx.com?word=in">in</a> <a href="xxx.com?
word=america">America</a>'s "<a href="xxx.com?word=west">West</a>," ...

添加每个单词的链接，考虑标点符号，收缩和Unicode

1 个答案: