当HTML代码不一致时,如何在python中使用bs4识别正确的td标记

时间:2016-08-12 23:37:06

标签: python html beautifulsoup bs4

我在Python中使用BeautifulSoup4来解析一些HTML代码。我设法深入到正确的表格并识别td标签,但我面临的问题是标签中的style属性应用不一致,并且正在完成获取正确td的任务标记真正的挑战。

我试图提取的数据是一个日期字段,但在任何时候都会有多个使用CSS隐藏的td标记(可见的内容取决于HTML代码中其他位置选择的选项值) 。

实际例子:

<td style="display: none;">01/03/2016</td>
<td style="display: table-cell;">27/10/2015</td> <-- this is the tag I want

<td style="display:none">23/02/2016</td>
<td style="">09/05/2011</td> <-- this is the tag I want
<td style="display: none;">29/03/2011</td>
<td style="display:none">19/10/2010</td>

<td>27/10/2015</td> <-- this is the tag I want
<td style="display: none">01/03/2016</td>
<td style="display: none">22/03/2016</td>

<td style="display:none">11/04/2015</td>
<td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
<td style="display: none">18/10/2013</td>

如何排除/删除不正确的项目(其中包含display:nonedisplay: none的样式),让我留下我真正想要的项目?

1 个答案:

答案 0 :(得分:1)

使用列表comp过滤tds,仅当td在集合{"display:none", "display: none;","display: none;","display: none"}中没有样式属性时才保留:

In [8]: h1 = """"<td style="display: none;">01/03/2016</td>
   ...: <td style="display: table-cell;">27/10/2015</td>"""

In [9]: h2 = """"<td style="display:none">23/02/2016</td>
   ...: <td style="">09/05/2011</td> <-- this is the tag I want
   ...: <td style="display: none;">29/03/2011</td>
   ...: <td style="display:none">19/10/2010</td>"""

In [10]: h3 = """"<td>27/10/2015</td> <-- this is the tag I want
   ....: <td style="display: none">01/03/2016</td>
   ....: <td style="display: none">22/03/2016</td>"""

In [11]: h4 = """<td style="display:none">11/04/2015</td>
   ....: <td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
   ....: <td style="display: none">18/10/2013</td>"""

In [12]: ignore = {"display:none", "display: none;", "display: none;", "display: none"}

In [13]: for html in [h1, h2, h3, h4]:
   ....:         soup = BeautifulSoup(html, "html.parser")
   ....:         print([td for td in soup.find_all("td") if not td.get("style") in ignore])
   ....:     
[<td style="display: table-cell;">27/10/2015</td>]
[<td style="">09/05/2011</td>]
[<td>27/10/2015</td>]
[<td style="display: table-cell;">02/02/2016</td>]