Question

我有一个字符串，该字符串在特定的单词或子字符串周围定义了标签。例如：

text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss> 
<hhh>10</hhh>'

如何获取字符串<xxx>ibis and the</xxx>，<ccc>NW</ccc>，<sss>Jan</sss>和<hhh>10</hhh>。这些标签可以是任何东西，但覆盖一个单词或几个单词的标签将是相似的。

Answer 1

通常，您不希望正则表达式解析（X）HTML（more info in this answer）。更好的选择是使用解析器。这个例子是beautifulsoup：

data = '''text = 'Bring me to <xxx>ibis and the</xxx> in <ccc>NW</ccc> and the <sss>Jan</sss>
<hhh>10</hhh>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag.get_text(strip=True))

打印：

ibis and the
NW
Jan
10

编辑：要获取整个标签字符串：

for tag in soup.select('xxx, ccc, sss, hhh'):
    print(tag)

打印：

<xxx>ibis and the</xxx>
<ccc>NW</ccc>
<sss>Jan</sss>
<hhh>10</hhh>

编辑II：如果您有要查找的标签列表：

list_of_tags = ['xxx', 'ccc', 'sss', 'hhh']
for tag in soup.find_all(list_of_tags):
    print(tag)

如何在标签/子字符串之间找到多个字符串？

1 个答案: