Question

使用jQuery选择器，您可以选择包含带有$("div:contains('John')")的innerText“John”的div，因此您可以匹配第二个<div>：

<div>Bill</div>
<div>John</div>
<div>Joe</div>

我如何在Python的Beautiful Soup或其他Python模块中执行此操作？

我刚看了a lecture on scraping form PyCon 2010，他提到你可以在lxml中使用CSS选择器。我是否必须使用它，或者是否只有汤的方法？

背景：要求解析抓取的网页。

Answer 1

使用BeautifulSoup更简洁的方式：

>>> soup('div', text='John')
[u'John']
>>> import re
>>> soup('div', text=re.compile('Jo'))
[u'John', u'Joe']

soup()相当于soup.findAll()。你可以使用字符串，正则表达式，任意函数来选择你需要的东西。

stdlib的ElementTree就足够了：

from xml.etree import cElementTree as etree

xml = """
    <div>Bill</div>
    <div>John</div>
    <div>Joe</div>
"""
root = etree.fromstring("<root>%s</root>" % xml)
for div in root.getiterator('div'):
    if "John" in div.text:
       print(etree.tostring(div))

Answer 2

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("""
... <div>Bill</div>
... <div>John</div>
... <div>Joe</div>
... """)
# equality
>>> [tag for tag in soup.findAll('div') if tag.text == 'John']
[<div>John</div>]
# containment
>>> [tag for tag in soup.findAll('div') if 'John' in tag.text]
[<div>John</div>]

Answer 3

Beautiful Soup 现在支持 :contains 选择器！

要搜索包含文本 div 的 John，请尝试：

html = """
<div>Bill</div>
<div>John</div>
<div>Joe</div>
"""
soup = BeautifulSoup(html, "html.parser")

>>> print(soup.select_one("div:contains('John')"))
<div>John</div>

注意：要使用选择器，请使用 .select_one() 代替 .find()，或使用 select() 代替 find_all()。

BeautifulSoup / Python中的contains（）选择器的等价物

3 个答案: