BeautifulSoup HTML表解析没有类

时间:2016-09-21 13:44:41

标签: python html beautifulsoup

我有这个html表:我需要从这个表中获取特定数据并将其分配给变量,我不需要所有的信息。 flag ="阿拉伯联合酋长国",home_port ="沙迦"等等,因为没有“班级”。在html元素上,我们如何提取这些数据。

        r = requests.get('http://maritime-connector.com/ship/'+str(imo_number),  headers={'User-Agent': 'Mozilla/5.0'})
    soup = BeautifulSoup(r.content, "lxml")
    table = soup.find("table", { "class" : "ship-data-table" })
    for row in table.findAll("tr"):
        tname = row.findAll("th")
        cells = row.findAll("td")


        print (type(tname))
        print (type(cells))

我正在使用python模块beautfulSoup。

<table class="ship-data-table" style="margin-bottom:3px">
                        <thead>
                        <tr>
                            <th>IMO number</th>
                            <td>9492749</td>
                        </tr>
                        <tr>
                            <th>Name of the ship</th>
                            <td>SHARIEF PILOT</td>
                        </tr>
                                                    <tr>
                            <th>Type of ship</th>
                            <td>ANCHOR HANDLING VESSEL</td>
                        </tr>
                                                                                <tr>
                            <th>MMSI</th>
                            <td>470535000</td>
                        </tr>
                                                                                <tr>
                            <th>Gross tonnage</th>
                            <td>499 tons</td>
                        </tr>
                                                                                <tr>
                            <th>DWT</th>
                            <td>222 tons</td>
                        </tr>
                                                                                <tr>
                            <th>Year of build</th>
                            <td>2008</td>
                        </tr>
                                                                                <tr>
                            <th>Builder</th>
                            <td>NANYANG SHIPBUILDING - JINGJIANG, CHINA</td>
                        </tr>
                                                                                <tr>
                            <th>Flag</th>
                            <td>UNITED ARAB EMIRATES</td>
                        </tr>
                                                                                                            <tr>
                            <th>Home port</th>
                            <td>SHARJAH</td>
                        </tr>
                                                                                                            <tr>
                            <th>Manager & owner</th>
                            <td>GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES</td>
                        </tr>
                                                                                                                                        <tr>
                            <th>Former names</th>
                            <td>SUPERIOR PILOT until 2008 Sep</td>
                        </tr>
                                                    </thead>
                    </table>

2 个答案:

答案 0 :(得分:2)

浏览表格中的所有th元素,获取文字和以下td兄弟的文字:

from pprint import pprint

from bs4 import BeautifulSoup

data = """your HTML here"""

soup = BeautifulSoup(data, "html.parser")

result = {header.get_text(strip=True): header.find_next_sibling("td").get_text(strip=True)
          for header in soup.select("table.ship-data-table tr th")}
pprint(result)

这将构造一个很好的字典,其中标题为键,相应的td文本为值:

{'Builder': 'NANYANG SHIPBUILDING - JINGJIANG, CHINA',
 'DWT': '222 tons',
 'Flag': 'UNITED ARAB EMIRATES',
 'Former names': 'SUPERIOR PILOT until 2008 Sep',
 'Gross tonnage': '499 tons',
 'Home port': 'SHARJAH',
 'IMO number': '9492749',
 'MMSI': '470535000',
 'Manager & owner': 'GLOBAL MARINE SERVICES - SHARJAH, UNITED ARAB EMIRATES',
 'Name of the ship': 'SHARIEF PILOT',
 'Type of ship': 'ANCHOR HANDLING VESSEL',
 'Year of build': '2008'}

答案 1 :(得分:0)

我会做这样的事情:

html = """
        <your table>
    """

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')

flag = soup.find("th", string="Flag").find_next("td").get_text(strip=True)
home_port = soup.find("th", string="Home port").find_next("td").get_text(strip=True)


print(flag)
print(home_port)

这样我确保只匹配th元素中的文字并获取下一个td的内容