BeautifulSoup find_all()找不到所有请求的元素

时间:2018-03-17 15:11:36

标签: python python-2.7 beautifulsoup

我看到了BeautifulSoup的一些奇怪行为,如下例所示。

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')
pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
paras = soup.find_all('p', string=pattern)
print(len(paras)) # expected to find 3 paragraphs with word "color" in it
  2
print(paras[0].prettify())
  <p class="blue">
    This paragraph as a color of blue.
  </p>

print(paras[1].prettify())
  <p>
    This paragraph does not have a color.
  </p>

正如您所看到的那样<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p> find_all(...)的第一段没有被src选中,我无法弄明白为什么不这样做。

3 个答案:

答案 0 :(得分:2)

string属性要求标记仅包含文本而不包含标记。如果您尝试为第一个.string标记打印p,则会返回None,因为它中包含标记。

或者,为了更好地解释,documentation说:

  

如果某个标签只有一个孩子,且该孩子为NavigableString,则该子项将显示为.string

     

如果某个代码包含多个内容,则不清楚.string应引用的内容,因此.string定义为None

克服这个问题的方法是使用lambda函数。

html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')

first_p = soup.find('p')
print(first_p)
# <p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>
print(first_p.string)
# None
print(first_p.text)
# This has a color of red. Because it likes the color red

paras = soup.find_all(lambda tag: tag.name == 'p' and 'color' in tag.text.lower())
print(paras)
# [<p style="color: red;">This has a <b>color</b> of red. Because it likes the color red</p>, <p class="blue">This paragraph has a color of blue.</p>, <p>This paragraph does not have a color.</p>]

答案 1 :(得分:0)

如果您想拍摄'p',您可以这样做:

import re
from bs4 import BeautifulSoup
html = """<p style='color: red;'>This has a <b>color</b> of red. Because it likes the color red</p>
<p class='blue'>This paragraph has a color of blue.</p>
<p>This paragraph does not have a color.</p>"""
soup = BeautifulSoup(html, 'html.parser')

paras = soup.find_all('p')
for p in paras:
  print (p.get_text())

答案 2 :(得分:0)

我还没弄清楚为什么指定find_all(...)的字符串(或旧版本的BeautifulSoup的文字)参数不能给我我想要的东西但是,以下确实给了我一个广义解决方案。

pattern = re.compile('color', flags=re.UNICODE+re.IGNORECASE)
desired_tags = [tag for tag in soup.find_all('p') if pattern.search(tag.text) is not None]