Question

我必须从一千个网站，本地HTML文件中删除数据，复杂的是这些网站就像90的结构，几乎相同的嵌套表结构，没有id没有CSS类仅嵌套表，如何在一个tr标记中的文本中选择特定的表基。

XPath不是解决方案，因为网站主要是相同的结构，但并不总是具有相同的表顺序，因此我正在寻找从所有这些表中提取这些表数据的方法，选择或搜索某些表格b中的某些文本并通过它获取父标记。

有什么想法吗？

每个页面上的代码都是巨大的，这里是结构的一个例子，数据并不总是在同一个表位。

更新感谢alecxe我制作了这段代码

# coding: utf-8
from bs4 import BeautifulSoup
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

html_content = """
<body>
 <table id="gotthistable">
     <tr>
         <table id="needthistable">
             <tr>
                 <td>text i'm searching</td>
             </tr>
             <tr>
                 <td>Some other text</td>
             </tr>
         </table>
     </tr>
     <tr>
         <td>
             <table>
                 <tr>
                     <td>Other text</td>
                 </tr>
                 <tr>
                     <td>Some other text</td>
                 </tr>
             </table>
         </td>
     </tr>
 </table>

 <table>
     <tr>
         <td>Different table</td>
 </tr>
 </table>
</body>
 """
soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "table" and "searching" in tag.text)
print table

打印表或汤变量的输出是相同的：

<table>
    <tr>
        <table id="needthistable">
            <tr>
                <td>text i'm searching</td>
            </tr>
    </tr>
.
.
.

但使用此代码：

soup = BeautifulSoup(html_content, "lxml")
table = soup.find(lambda tag: tag.name == "td" and "searching" in tag.text).parent.parent
print table

我得到了我想要的输出：

<table id="needthistable">
    <tr>
        <td>text im searching</td>
    </tr>
    <tr>
        <td>Some other text</td>
    </tr>
</table>

但是如果并不总是在同一个两个父元素上呢？我的意思是如果有一个 td 标签，我怎样才能得到它所属的表格。

Answer 1

使用BeautifulSoup regex filter：

如果传入正则表达式对象，Beautiful Soup将进行过滤使用search（）方法反对该正则表达式。

示例：

soup.find_all(name='tr', text=re.compile('this is part or full text of tr'))

Answer 2

您应该将find()与searching function一起使用，并检查表格的.text以包含所需的文字：

soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> data = """
... <body>
...     <table>
...         <tr>
...             <td>This text has a part of text</td>
...         </tr>
...         <tr>
...             <td>Some other text</td>
...         </tr>
...     </table>
... 
...     <table>
...         <tr>
...             <td>Different table</td>
...         </tr>
...     </table>
... </body>
... 
... """
>>> 
>>> soup = BeautifulSoup(data, 'lxml')
>>> 
>>> table = soup.find(lambda tag: tag.name == "table" and "part of text" in tag.text)
>>> print(table)
<table>
    <tr>
        <td>This text has a part of text</td>
    </tr>
    <tr>
        <td>Some other text</td>
    </tr>
</table>

如何在beautifulsoup4

2 个答案: