我试图不匹配XML标记之后或之前的单词。
import re
strTest = "<random xml>hello this was successful price<random xml>"
for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
c1 = c.group(1)
c2 = c.group(2)
if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
print c1
结果是:
xml
this
was
successful
xml
通缉结果:
this
was
successful
我一直在尝试消极的前瞻和消极的后瞻性断言。我不确定这是否是正确的方法,我将不胜感激。
答案 0 :(得分:2)
首先,直接回答你的问题:
我通过检查由包含(主要)字母或'&lt;'的字符序列组成的每个'单词'来做到这一点。或''gt;'。当正则表达式将它们提供给some_only
时,我会查找后两个字符中的一个。如果两者都没有出现,我打印“单词”。
>>> import re
>>> strTest = "<random xml>hello this was successful price<random xml>"
>>> def some_only(matchobj):
... if '<' in matchobj.group() or '>' in matchobj.group():
... pass
... else:
... print (matchobj.group())
... pass
...
>>> ignore = re.sub(r'[<>\w]+', some_only, strTest)
this
was
successful
这适用于您的测试字符串;然而,正如其他人已经提到的,在xml上使用正则表达式通常会导致许多困境。
要使用更传统的方法,我必须整理该xml字符串中的一些错误,即将random xml
更改为random_xml
并使用正确的结束标记。
我更喜欢使用lxml库。
>>> strTest = "<random_xml>hello this was successful price</random_xml>"
>>> from lxml import etree
>>> tree = etree.fromstring(strTest)
>>> tree.text
'hello this was successful price'
>>> tree.text.split(' ')[1:-1]
['hello', 'this', 'was', 'successful', 'price']
>>> tree.text.split(' ')[1:-1]
['this', 'was', 'successful']
答案 1 :(得分:0)
我会试一试。由于我们已经做的不仅仅是正则表达式,所以将它放入列表并删除第一个/最后一个项目:
import re
strTest = "<random xml>hello this was successful price<random xml>"
thelist = []
for c in re.finditer(r'(?<![<>])(\b\w+\b)(?<!=[<>])(\W+)',strTest):
c1 = c.group(1)
c2 = c.group(2)
if ('<' != c2[0]) and ('<' != c.group(1)[len(c.group(1))-1]):
thelist.append(c1)
thelist = thelist[1:-1]
print (thelist)
结果:
['this', 'was', 'successful']
我个人会尝试解析XML,但是因为你已经有了这个代码,所以稍作修改就可以了。
答案 2 :(得分:0)
使用列表执行此操作的简单方法,但我假设XML标记后面或前面的单词并且正确的标记不用空格分隔:
test = "<random xml>hello this was successful price<random xml>"
test = test.split()
new_test = []
for val in test:
if "<" not in val and ">" not in val:
new_test.append(val)
print(new_test)
结果将是:
['this', 'was', 'successful']
答案 3 :(得分:0)
我认为根本不需要使用regex
,你可以用单行列表理解来解决它:
words = [w for w in test.split() if "<" not in w and ">" not in w]