Question

如何删除<ref> *some text*</ref>内的文字以及ref本身？

'...and so on<ref>Oxford University Press</ref>.'

中的

re.sub(r'<ref>.+</ref>', '', string)只删除<ref> if <ref>之后是空格

编辑：我认为它与字边界有关...或者？

EDIT2 我需要的是它会计算最后一个（结束）</ref>，即使它在换行符上也是如此。

Answer 1

我真的没有看到你的问题，因为粘贴的代码将删除字符串的<ref>...</ref>部分。但是，如果你的意思是那个并且没有删除空的ref标签：

re.sub(r'<ref>.+</ref>', '', '...and so on<ref></ref>.')

然后你需要做的是改变。+ with。*

A +表示一个或多个，而*表示零或更多。

来自http://docs.python.org/library/re.html：

'.' (Dot.) In the default mode, this matches any character except a newline.
    If the DOTALL flag has been specified, this matches any character including
    a newline.
'*' Causes the resulting RE to match 0 or more repetitions of the preceding
    RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’
    followed by any number of ‘b’s.
'+' Causes the resulting RE to match 1 or more repetitions of the preceding
    RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
    not match just ‘a’.
'?' Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
    ab? will match either ‘a’ or ‘ab’.

Answer 2

您可能需要谨慎，不要因为有多个结束</ref>而删除大量文本。在我看来，正则表达式更准确：

r'<ref>[^<]*</ref>'

这可以防止“贪婪”的匹配。

BTW：有一个很棒的工具叫做The Regex Coach来分析和测试你的正则表达式。您可以在http://www.weitz.de/regex-coach/

找到它

编辑：忘了在第一段中添加代码标记。

Answer 3

你可以制作一个花哨的正则表达式来做你想要的，但你需要使用DOTALL和非贪婪的搜索，你需要了解正则表达式的工作原理，你不需要。

你最好的选择是使用字符串方法而不是正则表达式，无论如何都是pythonic：

while '<reg>' in string:
    begin, end = string.split('<reg>', 1)
    trash, end = end.split('</reg>', 1)
    string = begin + end

如果你想要非常通用，允许标签中的标签或空格和属性的奇怪大写，你也不应该这样做，而是投资学习一个html / xml解析库。 lxml目前似乎得到了广泛推荐和良好支持。

Answer 4

如果您尝试使用正则表达式执行此操作，则可以使用world of trouble。您正在有效地尝试解析某些内容，但您的解析器无法完成任务。

在字符串之间贪婪地匹配可能会吃得太多，如下例所示：

<ref>SDD</ref>...<ref>XX</ref>

你最终会搞清楚整个中间部分。

你真的想要一个解析器，比如Beautiful Soup。

from BeautifulSoup import BeautifulSoup, Tag
s = "<a>sfsdf</a> <ref>XX</ref> || <ref>YY</ref>"
soup = BeautifulSoup(s)
x = soup.findAll("ref")
for z in x:
  soup.ref.replaceWith('!')
soup # <a>sfsdf</a> ! || !

用正则表达式替换单词的某些部分

4 个答案: