Question

我想解析可能包含类似xml- / html的标记标记的字符串。但是我想避免使用第三方模块，例如lxml或beautifulsoup，因为标签非常简单，只包含一个名称，它们不能重叠，嵌套，也不能有其他属性。< / p>

由于这些原因，我一直在尝试仅使用内置的re模块和正则表达式。

这是我到目前为止所尝试的：

import re

pattern = r'<(?P<tag>\w+)>(?P<content>.+)</(?P=tag)>'
my_str = ("Here's some <first>sample stuff</first> in the "
          "<second>middle</second> of some other text.")    
print re.findall(pattern, my_str)

结果：

[('first', 'sample stuff'), ('second', 'middle')]

这很好，它给了我所有匹配的标签和有关它们的信息，但我还需要知道与模式不匹配的文本，因为它需要处理（按照发现的顺序） - - 所以接下来我尝试使用模块的split()函数，如下所示将字符串分成标记和未标记的部分：

print re.split(pattern, my_str)

结果：

["Here's some ", 'first', 'sample stuff', ' in the ', 'second', 'middle',
 ' of some other text.']

这看起来很有希望，结果现在包含了字符串中的所有内容 - 与模式匹配的部分以及不匹配的部分 - 但很难说明在结果简单列表中的内容是什么它返回的字符串。

所以我的问题是这些缺陷是否可以修复以及如何（不需要使用其他第三方模块）。

如果我能够获得如下所述的简单内容，那么任何标记内容的信息都很容易区分，那将是理想的：

["Here's some ", ('first', 'sample stuff'), ' in the ', ('second', 'middle'),
 ' of some other text.']

Answer 1

您可以使用finditer并根据span手动执行拆分，类似这样（免责声明：未经测试的边界情况！）：

def split_and_keep(pattern, my_str):
    index = 0
    for match in re.finditer(pattern, my_str):
        span = match.span()
        if span[0] > index:
            yield my_str[index:span[0]]
        index = span[1]
        yield match.groupdict()
    left = my_str[index:]
    if left:
        yield left

会给出

>>> for part in split_and_keep(pattern, my_str):
...     print(repr(part))
...     
"Here's some "
{'content': 'sample stuff', 'tag': 'first'}
' in the '
{'content': 'middle', 'tag': 'second'}
' of some other text.'

在这里，您可以按类型分析匹配和不匹配，但显然您可以将其调整为更健全的东西。

Answer 2

您可以使用start() and end() functions。

last_match= 0
for match in re.finditer(pattern, my_str):
    print 'this text matched:', match.group()
    print "this text didn't:", my_str[last_match:match.start()]
    last_match= match.end()
print 'remaining text:', my_str[last_match:]

Answer 3

ElementTree模块怎么样？它内置所以没有第三方代表，它将很容易处理。

from xml.etree import ElementTree as ET

data = ("Here's some <first>sample stuff</first> in the "
        "<second>middle</second> of some other text.")

root = ET.fromstring('<x>%s</x>' % data)

# First block of text (before any tags)
print(root.text)

for child in root:
    # Tag, and text within tag
    print((child.tag, child.text))
    # Next block of text outside tags
    print(child.tail)

输出：

Here's some
('first', 'sample stuff')
 in the
('second', 'middle')
 of the other text.

如果这是您需要的，或者作为生成器，重新排列这个以输出列表会很容易：

def parse(data):
    root = ET.fromstring('<x>%s</x>' % data)
    yield root.text
    for child in root:
        yield (child.tag, child.text)
        yield child.tail

关于对YAGNI的评论，声明是“当你真正需要它们时总是实现它们，永远不要在你预见到你需要它们的时候。”关键词是“实现”。您没有实现xml解析器，只使用一个。我完全同意这个原则，因为它适用于您自己的代码，但它不适用于您使用的每个库，或者它会停在哪里？ Python包含许多你不会使用的函数，这是否意味着你应该编译自己的python并删除那些函数？ YAGNI原则是编写代码的绝佳方法，但不是使用其他人的代码。事实上，如果你遵循原则背后的推理，你应该使用预建的库而不是自己编写。理由是：

节省时间，因为您避免编写不需要的代码。
您的代码更好，因为您可以避免使用“猜测”来对其进行污染，但结果或多或少是错误的，但无论如何都要坚持下去。

因此，为了节省时间，请避免在使用已编写的代码时编写代码。为了使您的代码更好，请避免使用复制现有功能的代码来污染代码。

还要考虑编写自己的mini-xml解析器以去除未使用的代码并且（可能）提高性能可以很容易地被视为Premature Optimization

Answer 4

这里看起来很简单。它使用re.findall()和Rawing建议的正则表达式的略微修改版本，它还捕获标记之前的非标记文本（如果有的话）。

我已经扩展了测试的字符串数量，以包括我能想到的所有边缘情况。对Rawing正则表达式的轻微修改正在将(?P<content>.+)更改为(?P<content>.*)，因此'<abc></abc>'之类的空结构也将被视为有效标记。

from __future__ import print_function
import re

pattern = r'(?P<text>.*?)(?:<(?P<tag>\w+)>(?P<content>.*)</(?P=tag)>|$)'

testcases = [ "Here's some <first>sample stuff</first> in the "
                "<second>middle</second> of some other text.",
              "<first>sample stuff</first> in the "
                "<second>middle</second> of some other text.",
              "Here's some <first>sample stuff</first> in the "
                "<second>middle</second>",
              "<first>sample stuff</first> in the <second>middle</second>",
              "Here's some ",
              "<first>sample stuff</first>",
              "<first></first>",
]

for my_str in testcases:
    print(' my_str: {!r}'.format(my_str))
    # nitty-gritty of conversion from match objects to list
    results = []
    for text, tag, content in re.findall(pattern, my_str):
        if text: results.append(text)
        if tag: results.append((tag, content))
    print('results: {}\n'.format(results))

输出：

 my_str: "Here's some <first>sample stuff</first> in the <second>middle</second> of some other text."
results: ["Here's some ", ('first', 'sample stuff'), ' in the ', ('second', 'middle'), ' of some other text.']

 my_str: '<first>sample stuff</first> in the <second>middle</second> of some other text.'
results: [('first', 'sample stuff'), ' in the ', ('second', 'middle'), ' of some other text.']

 my_str: "Here's some <first>sample stuff</first> in the <second>middle</second>"
results: ["Here's some ", ('first', 'sample stuff'), ' in the ', ('second', 'middle')]

 my_str: '<first>sample stuff</first> in the <second>middle</second>'
results: [('first', 'sample stuff'), ' in the ', ('second', 'middle')]

 my_str: "Here's some "
results: ["Here's some "]

 my_str: '<first>sample stuff</first>'
results: [('first', 'sample stuff')]

 my_str: '<first></first>'
results: [('first', '')]

查找匹配模式的字符串中的文本以及不匹配模式的部分

4 个答案: