在BeautifulSoup CSS选择器中处理冒号

时间:2016-01-01 04:51:06

标签: python html css-selectors beautifulsoup html-parsing

输入HTML:

<div style="display: flex">
    <div class="half" style="font-size: 0.8em;width: 33%;"> apple </div>
    <div class="half" style="font-size: 0.8em;text-align: center;width: 28%;"> peach </div>
    <div class="half" style="font-size: 0.8em;text-align: right;width: 33%;" title="nofruit"> cucumber </div>
</div>

所需的输出所有div元素完全位于<div style="display: flex">下。

我尝试使用CSS selector找到父div

div[style="display: flex"]

这会引发错误:

>>> soup.select('div[style="display: flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/bs4/element.py", line 1400, in select
    'Only the following pseudo-classes are implemented: nth-of-type.')
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

看起来BeautifulSoup尝试将冒号解释为伪类语法。

我已尝试按照Handling a colon in an element ID in a CSS selector建议的建议,但仍然会出错:

>>> soup.select('div[style="display\: flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/bs4/element.py", line 1400, in select
    'Only the following pseudo-classes are implemented: nth-of-type.')
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
>>> soup.select('div[style="display\3A flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/bs4/element.py", line 1426, in select
    'Unsupported or invalid CSS selector: "%s"' % token)
ValueError: Unsupported or invalid CSS selector: "div[style="displayA"

问题:

BeautifulSoup CSS选择器中的属性值中使用/转义冒号的正确方法是什么?

请注意,我可以使用部分属性匹配来解决它:

soup.select("div[style$=flex]")

或者,使用find_all()

soup.find_all("div", style="display: flex")

另请注意,我理解使用style来定位元素远不是一个好的定位技术,但问题本身是通用的,提供的HTML只是一个例子。

2 个答案:

答案 0 :(得分:2)

更新:此问题现已在BeautifulSoup 4.5.0中修复,如果需要升级:

pip install --upgrade beautifulsoup4

旧答案:

BeautifulSoup问题跟踪器上创建了一个问题:

如果启动板问题发生任何更新,将更新答案。

答案 1 :(得分:1)

确定这完全构成答案,因为它肯定会被打破。但是,奇怪的是,错误不是由:本身触发,而是由:后跟空格触发。该错误表明它试图使用空格后的任何内容作为CSS选择器。

例如,编辑HTML以删除空格会使块再次可选:

>>> from bs4 import BeautifulSoup
>>> html = """
... <div style="display:flex">
...     <div class="half" style="font-size: 0.8em;width: 33%;"> apple </div>
...     <div class="half" style="font-size: 0.8em;text-align: center;width: 28%;"> peach </div>
...     <div class="half" style="font-size: 0.8em;text-align: right;width: 33%;" title="nofruit"> cucumber </div>
... </div>
... """

>>> soup = BeautifulSoup(html)
>>> soup.select('div[style="display: flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.4/dist-packages/bs4/element.py", line 1313, in select
    'Unsupported or invalid CSS selector: "%s"' % token)
ValueError: Unsupported or invalid CSS selector: "flex"]"

>>> soup.select('div[style="display:flex"]')
[<div style="display:flex">
<div class="half" style="font-size: 0.8em;width: 33%;"> apple </div>
<div class="half" style="font-size: 0.8em;text-align: center;width: 28%;"> peach </div>
<div class="half" style="font-size: 0.8em;text-align: right;width: 33%;" title="nofruit"> cucumber </div>
</div>]

不幸的是,空间是通常的风格,所以这可能不会让你走得太远!