Question

<h2 class="hello-word"><a href="http://www.google.com">Google</a></h2>

如何获取a代码（Google）的价值？

print soup.select("h2 > a")

返回整个标签，我只想要这个值。此外，页面上可能有多个H2。如何过滤类hello-word？

Answer 1

您可以在CSS选择器的.hello-word上使用h2，仅选择带有h2类的hello-word个标签，然后选择其子a。此外，soup.select()会返回所有可能匹配项的列表，因此您可以轻松地对其进行迭代并调用每个元素.text以获取文本。示例 -

for i in soup.select("h2.hello-word > a"):
    print(i.text)

示例/演示（我添加了一些我自己的元素，一个稍微不同的类来显示选择器的工作情况） -

>>> from bs4 import BeautifulSoup
>>> s = """<h2 class="hello-word"><a href="http://www.google.com">Google</a></h2>
... <h2 class="hello-word"><a href="http://www.google.com">Google12</a></h2>
... <h2 class="hello-word2"><a href="http://www.google.com">Google13</a></h2>"""

>>> soup = BeautifulSoup(s,'html.parser')

>>> for i in soup.select("h2.hello-word > a"):
...     print(i.text)
...
Google
Google12

Answer 2

试试这个：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<h2 class="hello-word"><a     href="http://www.google.com">Google</a></h2>', 'html.parser')
>>> soup.text
'Google'

您也可以使用lxml.html库

    >>> import lxml.html
    >>> from lxml.cssselect import CSSSelector
    >>> txt = '<h2 class="hello-word"><a href="http://www.google.com">Google</a></h2>'
    >>> tree = lxml.html.fromstring(txt)
    >>> sel = CSSSelector('h2 > a')
    >>> element = sel(tree)[0]
    >>> element.text
    Google

如何获得soup.select？

2 个答案: