Question

考虑一个字符串列表。我想找到所有以<开始，以>结束的子字符串。

该怎么做？

我已经尝试通过以下问题来转换正则表达式：Regular expression to return text between parenthesis

但是由于我不熟悉正则表达式，所以我的试验都没有成功。

注1：我不关注正则表达式，欢迎任何可行的解决方案。

注意2：我没有解析HTML或任何标记语言

Answer 1

使用re.findall：

import re
matches = re.findall(r"<(.*?)>", s)

我发现RegExr是修改正则表达式的绝佳网站。

Answer 2

这应该可以满足您的需求。

import re

strings = ["x<first>x<second>x", "x<third>x"]
result = [substring for substring in re.findall(r"<.*?>", string) for string in strings]
print(result)

在这里，re.findall在正则表达式<.*?>的子字符串中找到所有匹配项。 list comprehension用于遍历列表中的所有字符串以及字符串中的所有匹配项。

顺便问一下，为什么要匹配这样的尖括号？如果要解析HTML或XML，最好使用专用的解析器，因为编写自己的正则表达式容易出错，并且仅正则表达式不能处理任意嵌套的元素。

Answer 3

您可以使用正则表达式来做到这一点：

import re

regex = r"<([^>]*)>"

test_list = ["<hi how are you> I think <not anymore> whatever <amazing hi>", "second <first> <third>"]

for test_str in test_list:
    matches = re.finditer(regex, test_str, re.MULTILINE)

    for matchNum, match in enumerate(matches, start=1):

        print("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum=matchNum, start=match.start(),
                                                                        end=match.end(), match=match.group()))

输出：

Match 1 was found at 0-16: <hi how are you>
Match 2 was found at 25-38: <not anymore>
Match 3 was found at 48-60: <amazing hi>
Match 1 was found at 7-14: <first>
Match 2 was found at 15-22: <third>

如果要删除“ <”和“>”，可以执行字符串替换。

但是，如果您具有HTML或XML这样的结构化文本，请使用合法的解析器。

Python，搜索<和>中包含的字符串

3 个答案: