Question

我想使用BeautifulSoup在大型文本文档中标识一个拆分点。因此，我制定了一个正则表达式来查找其中出现特定字符串的Tag。问题是，如果我正在搜索的字符串中还有其他格式/子节点，则该功能将不起作用。

t1 = BeautifulSoup("<p class=\"p p8\"><strong>Question-And-Answer</strong></p>")

t2 = BeautifulSoup("<p class=\"p p8\"><strong>Question</strong>-<strong>And</strong>-<strong>Answer</strong></p>")

t1.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> 'Question-And-Answer'

t2.find(text = re.compile("Question[s]*-And-Answer[s]*", re.IGNORECASE))
>>> None

输出应为p标记对象。

Answer 1

这里的问题是，要查找的文本在strong节点内被p标记分割，因此在{{1中使用text参数进行正则表达式搜索}}无效，这只是在BS中实现的方式。

如果您知道文本位于.find节点中，则可以在p调用中使用lambda表达式，并针对.find运行正则表达式搜索每个text标签的属性，以查找所需的元素：

请注意，print(t2.find(lambda t: t.name == "p" and re.search(r'Questions*-And-Answers*', t.text))) # => Question-And-Answer与正则表达式中的[s]相同。

搜索正则表达式时忽略子节点

1 个答案: