Question

我试图不匹配XML标记之后或之前的单词。

import re

strTest = "<random xml>hello this was successful price<random xml>"

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          print c1

结果是：

xml
this
was
successful
xml

通缉结果：

this
was
successful

我一直在尝试消极的前瞻和消极的后瞻性断言。我不确定这是否是正确的方法，我将不胜感激。

Answer 1

首先，直接回答你的问题：

我通过检查由包含（主要）字母或'＆lt;'的字符序列组成的每个'单词'来做到这一点。或''gt;'。当正则表达式将它们提供给some_only时，我会查找后两个字符中的一个。如果两者都没有出现，我打印“单词”。

>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
...     if '<' in matchobj.group() or '>' in matchobj.group():
...         pass
...     else:
...         print (matchobj.group())
...         pass
... 
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful

这适用于您的测试字符串;然而，正如其他人已经提到的，在xml上使用正则表达式通常会导致许多困境。

要使用更传统的方法，我必须整理该xml字符串中的一些错误，即将random xml更改为random_xml并使用正确的结束标记。

我更喜欢使用lxml库。

>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']

Answer 2

我会试一试。由于我们已经做的不仅仅是正则表达式，所以将它放入列表并删除第一个/最后一个项目：

import re

strTest = "<random xml>hello this was successful price<random xml>"

thelist = []

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          thelist.append(c1)

thelist = thelist[1:-1]

print (thelist)

结果：

['this', 'was', 'successful']

我个人会尝试解析XML，但是因为你已经有了这个代码，所以稍作修改就可以了。

Answer 3

使用列表执行此操作的简单方法，但我假设XML标记后面或前面的单词并且正确的标记不用空格分隔：

test = "<random xml>hello this was successful price<random xml>"

test = test.split()

new_test = []
for val in test:
  if "<" not in val and ">" not in val:
   new_test.append(val)

print(new_test)

结果将是：

['this', 'was', 'successful']

Answer 4

我的灵魂......

我认为根本不需要使用regex，你可以用单行列表理解来解决它：

words = [w for w in test.split() if "<" not in w and ">" not in w]

匹配单词如果没有跟随或先于＆lt;或者＆gt;

4 个答案:

我的灵魂......