说我有以下HTML:
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>
<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
我希望能够找到包含我正在寻找的所有关键字的所有标签。例如。 (例2和3不起作用):
>>> len(soup.find_all(text="world"))
2
>>> len(soup.find_all(text="world puzzle"))
1
>>> len(soup.find_all(text="world puzzle book"))
0
我一直试图想出一个允许我搜索所有关键字的正则表达式,但似乎ANDing是不可能的(只有ORing)。
提前致谢!
答案 0 :(得分:5)
像这样进行复杂匹配的最简单方法是write a function that performs the match,并将函数作为text
参数的值传递。
def must_contain_all(*strings):
def must_contain(markup):
return markup is not None and all(s in markup for s in strings)
return must_contain
现在你可以得到匹配的字符串:
print soup.find_all(text=must_contain_all("world", "puzzle"))
# [u"\nWho in the world am I? Ah, that's the great puzzle.\n"]
要获取包含字符串的标记,请使用.parent运算符:
print [text.parent for text in soup.find_all(text=must_contain_all("world", "puzzle"))]
# [<p>Who in the world am I? Ah, that's the great puzzle.</p>]
答案 1 :(得分:1)
您可能需要考虑使用lxml而不是BeautifulSoup。 lxml允许您通过XPath查找元素:
使用此锅炉板设置:
import lxml.html as LH
import re
html = """
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>
<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
"""
doc = LH.fromstring(html)
这会查找包含字符串<p>
的所有world
代码中的文字:
print(doc.xpath('//p[contains(text(),"world")]/text()'))
['\nIf everybody minded their own business, the world would go around a great deal faster than it does.\n', "\nWho in the world am I? Ah, that's the great puzzle.\n"]
这会找到包含<p>
和world
的所有puzzle
代码中的所有文字:
print(doc.xpath('//p[contains(text(),"world") and contains(text(),"puzzle")]/text()'))
["\nWho in the world am I? Ah, that's the great puzzle.\n"]
答案 2 :(得分:0)
这可能不是最有效的方法,但您可以尝试设置交叉点:
len(set(soup.find_all(text="world")
& set(soup.find_all(text="book")
& set(soup.find_all(text="puzzle")))
答案 3 :(得分:0)
一点骨架(我使用的是lxml而不是BeautifulSoup,但你可以使用soup.findAll来适应它):
html = """
<p>
If everybody minded their own business, the world would go around a great deal faster than it does.
</p>
<p>
Who in the world am I? Ah, that's the great puzzle.
</p>
"""
import lxml.html
import re
fragment = lxml.html.fromstring(html)
d = dict(
(node, set(re.findall(r'\S+', node.text_content())))
for node in fragment.xpath('//p'))
for node, it in d.iteritems():
# then use set logic to go from here...