我在Python中使用BeautifulSoup4来解析一些HTML代码。我设法深入到正确的表格并识别td标签,但我面临的问题是标签中的style属性应用不一致,并且正在完成获取正确td的任务标记真正的挑战。
我试图提取的数据是一个日期字段,但在任何时候都会有多个使用CSS隐藏的td标记(可见的内容取决于HTML代码中其他位置选择的选项值) 。
实际例子:
<td style="display: none;">01/03/2016</td>
<td style="display: table-cell;">27/10/2015</td> <-- this is the tag I want
和
<td style="display:none">23/02/2016</td>
<td style="">09/05/2011</td> <-- this is the tag I want
<td style="display: none;">29/03/2011</td>
<td style="display:none">19/10/2010</td>
和
<td>27/10/2015</td> <-- this is the tag I want
<td style="display: none">01/03/2016</td>
<td style="display: none">22/03/2016</td>
和
<td style="display:none">11/04/2015</td>
<td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
<td style="display: none">18/10/2013</td>
如何排除/删除不正确的项目(其中包含display:none
和display: none
的样式),让我留下我真正想要的项目?
答案 0 :(得分:1)
使用列表comp过滤tds,仅当td在集合{"display:none", "display: none;","display: none;","display: none"}
中没有样式属性时才保留:
In [8]: h1 = """"<td style="display: none;">01/03/2016</td>
...: <td style="display: table-cell;">27/10/2015</td>"""
In [9]: h2 = """"<td style="display:none">23/02/2016</td>
...: <td style="">09/05/2011</td> <-- this is the tag I want
...: <td style="display: none;">29/03/2011</td>
...: <td style="display:none">19/10/2010</td>"""
In [10]: h3 = """"<td>27/10/2015</td> <-- this is the tag I want
....: <td style="display: none">01/03/2016</td>
....: <td style="display: none">22/03/2016</td>"""
In [11]: h4 = """<td style="display:none">11/04/2015</td>
....: <td style="display: table-cell;">02/02/2016</td> <-- this is the tag I want
....: <td style="display: none">18/10/2013</td>"""
In [12]: ignore = {"display:none", "display: none;", "display: none;", "display: none"}
In [13]: for html in [h1, h2, h3, h4]:
....: soup = BeautifulSoup(html, "html.parser")
....: print([td for td in soup.find_all("td") if not td.get("style") in ignore])
....:
[<td style="display: table-cell;">27/10/2015</td>]
[<td style="">09/05/2011</td>]
[<td>27/10/2015</td>]
[<td style="display: table-cell;">02/02/2016</td>]