我想得到所有没有配对的元素。
这是一个从上到下读取的XML标记列表,删除了括号。
我想找到对(例如开头标记note
和结束标记/note
),将其从列表中删除,然后留下没有对的标记。
如何遍历列表,将每个标记与所有其他标记进行比较,并举例说:aha,我发现另一个标记'以正斜杠开头的标签?
感谢。
其他 - 更好 - 找到不匹配标签的想法?
PS:我确实希望保留列表的顺序,如果可能,在将标记与列表中的另一个标记进行比较时使用相等性。如果' in'使用运算符它不会起作用,因为如果标签名称是一个字母,如' a',那么搜索将返回包含a的所有元素,而不是完全匹配' a'
tags = ['note', 'to', 'bbb', 'bbb', 'firstname', '/firstname', 'lastname', '/lastname', 'from', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', '/from', '/to', 'elephant', 'll', 'from', '/from', 'a1', 'img', 'a2', 'from', 'from', '/from', '/from', '/a2', '/img', '/a1', 'heading', '/heading', 'body', '/body', '/note']
答案 0 :(得分:0)
您可以使用所有结束标记创建set
,然后使用该集来过滤标记。
>>> closing = set([t for t in tags if t.startswith("/")])
>>> [t for t in tags if "/" + t not in closing and t not in closing]
['bbb', 'bbb', 'hello', 'hello', 'hello', 'hello', 'hello', 'l', 'elephant', 'll']
但请注意,这并不会真正尊重"对"标签,但只是看看是否有"关闭"列表中相同标记的变体。例如,给定tags = ["a", "a", "/a"]
或tags = ["a", "/a", "a"]
,它会从列表中删除 a
的个实例。
答案 1 :(得分:0)
程序的第一部分获取列表中的所有标记。如果您发现这是找到不匹配括号的问题。它可以通过将列表视为堆栈来解决,并找出哪些标签有缺陷,并在此过程中进行迭代。
import re
def clean_attr(attr):
attr_list = re.split(r'\s+', attr)
if len(attr_list) == 1:
return attr
else:
return attr_list[0] + '>'
line="""
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.
</book>
</catalog>
"""
attr_open = re.findall(r'<[\w+\s=\"]+>', line)
attr_closed = re.findall(r'<\/\w+>', line)
all_attrs = re.findall(r'<[\w+\s=\"]+>|<\/\w+>', line)
all_attrs_cleaned = map(clean_attr, all_attrs)
# print all_attrs_cleaned
list_as_stack = []
not_closed = []
all_attrs_cleaned = iter(all_attrs_cleaned)
an_attr = all_attrs_cleaned.next()
try:
while all_attrs_cleaned:
if not an_attr.startswith('</'):
list_as_stack.append(an_attr)
an_attr = all_attrs_cleaned.next()
else:
temp = list_as_stack[-1]
if re.search(r'\w+', temp).group(0) == re.search(r'\w+', an_attr).group(0):
list_as_stack.pop()
an_attr = all_attrs_cleaned.next()
else:
if len(list_as_stack) != 0:
not_closed.append(an_attr)
an_attr = all_attrs_cleaned.next()
except Exception:
print "Stop Iter"
print list_as_stack
print not_closed
在上面的程序中,第一个数组告诉你哪些标签没有关闭,第二个数组告诉你哪些结束标签没有开始标签。