匹配单词如果没有跟随或先于<或者>

时间:2017-07-26 15:10:27

标签: python regex

我试图不匹配XML标记之后或之前的单词。

import re

strTest = "<random xml>hello this was successful price<random xml>"

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          print c1

结果是:

xml
this
was
successful
xml

通缉结果:

this
was
successful

我一直在尝试消极的前瞻和消极的后瞻性断言。我不确定这是否是正确的方法,我将不胜感激。

4 个答案:

答案 0 :(得分:2)

首先,直接回答你的问题:

我通过检查由包含(主要)字母或'&lt;'的字符序列组成的每个'单词'来做到这一点。或''gt;'。当正则表达式将它们提供给some_only时,我会查找后两个字符中的一个。如果两者都没有出现,我打印“单词”。

>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
...     if '<' in matchobj.group() or '>' in matchobj.group():
...         pass
...     else:
...         print (matchobj.group())
...         pass
... 
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful

这适用于您的测试字符串;然而,正如其他人已经提到的,在xml上使用正则表达式通常会导致许多困境。

要使用更传统的方法,我必须整理该xml字符串中的一些错误,即将random xml更改为random_xml并使用正确的结束标记。

我更喜欢使用lxml库。

>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']

答案 1 :(得分:0)

我会试一试。由于我们已经做的不仅仅是正则表达式,所以将它放入列表并删除第一个/最后一个项目:

import re

strTest = "<random xml>hello this was successful price<random xml>"

thelist = []

for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
     c1 = c.group(1)
     c2 = c.group(2)
     if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
          thelist.append(c1)

thelist = thelist[1:-1]

print (thelist)

结果:

['this', 'was', 'successful']

我个人会尝试解析XML,但是因为你已经有了这个代码,所以稍作修改就可以了。

答案 2 :(得分:0)

使用列表执行此操作的简单方法,但我假设XML标记后面或前面的单词并且正确的标记不用空格分隔:

test = "<random xml>hello this was successful price<random xml>"

test = test.split()

new_test = []
for val in test:
  if "<" not in val and ">" not in val:
   new_test.append(val)

print(new_test)

结果将是:

['this', 'was', 'successful']

答案 3 :(得分:0)

我的灵魂......

我认为根本不需要使用regex,你可以用单行列表理解来解决它:

words = [w for w in test.split() if "<" not in w and ">" not in w]