Question

我正在尝试使用BeautifulSoup 4从HTML文档中的特定标签中提取文本。我的HTML有一堆div标签，如下所示：

＆＃13;

<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;">
  <span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">
    Futures Daily Market Report for Financial Gas
    <br/>
    21-Jul-2015
    <br/>
   </span>
</div>
<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:54px; top:135px; width:46px; height:10px;">
  <span style="font-family: FIPXQM+Arial-BoldMT; font-size:10px">
    COMMODITY
    <br/>
   </span>
</div>

＆＃13;

我正在尝试从任何div标签中的所有span标签中获取文本，该标签的样式为＆＃34; left：54px＆＃34;。

如果我使用的话，我可以获得一个div：

soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',attrs={"style":"position:absolute; border: textbox 1px solid; "
                                         "writing-mode:lr-tb; left:42px; top:90px; "
                                         "width:195px; height:24px;"})

它返回：

[<div style="position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:42px; top:90px; width:195px; height:24px;"><span style="font-family: FIPXQM+Arial-BoldMT; font-size:12px">Futures Daily Market Report for Financial Gas
<br/>21-Jul-2015
<br/></span></div>]

但这只能让我得到一个完全匹配造型的div。我希望所有的div只匹配＆＃34; left：54px＆＃34;风格。

为此，我尝试了几种不同的方法：

soup = BeautifulSoup(open(extracted_html_file))
print soup.find_all('div',style='left:54px')
print soup.find_all('div',attrs={"style":"left:54px"})
print soup.find_all('div',attrs={"left":"54px"})

但是所有这些打印语句都返回空列表。

任何想法？

Answer 1

您可以根据此处的文档传递正则表达式而不是字符串：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments

所以我会试试这个：

import re

soup = BeautifulSoup(open(extracted_html_file))
soup.find_all('div', style = re.compile('left:54px'))

通过样式定位标签 - 使用Python 2和BeautifulSoup 4

1 个答案: