如何将包含跨越多列的单元格的html表转换为Python 3中的列表列表?

时间:2016-04-06 15:17:45

标签: python html web-scraping python-3.5 bs4

我是Python的新手,已经开始了一个需要一些webscrapping的小项目。我开始使用BS4但是在尝试将带有跨越多列的单元格的html表转换为列表列表(在Python 3中)时,我遇到了一些困难。

我想将此html表转换为列表列表,以便能够在带有终端表的文本模式下打印它。所以,我试图获得一些空的列表单元格来填充行的其余部分,只要有一个跨越5列的HTML单元格。

我认为我可能在(流利的)Python中可以轻松完成的事情过于复杂。有人可以帮忙吗?

此时我的代码:

    #!/usr/local/bin/python3
    # encoding: utf-8



    # just did a lot of experiments, so I will need to clean these imports! (some of them are related to the rest of the project anyway)
import sys
    import os
    import os.path
    import csv
    import re
    from textwrap import fill as tw_fill
    from random import randint
    from datetime import datetime, timedelta
    from copy import deepcopy
    from platform import node


    from colorclass import Color
    from urllib3 import PoolManager
    from bleach import clean
    from bs4 import BeautifulSoup
    from terminaltables import SingleTable



    def obter_estado_detalhado(tracking_code):
        """ Verify detailed tracking status for CTT shipment
        Ex: obter_estado_detalhado("EA746000000PT")
        """    
        ctt_url = "http://www.cttexpresso.pt/feapl_2/app/open/cttexpresso/objectSearch/objectSearch.jspx?lang=def&objects=" + tracking_code + "&showResults=true"
        estado = "- N/A -"

        dados_tracking = [[
            "Hora",
            "Estado",
            "Motivo",
            "Local",
            "Recetor"
            ]
        ]

    #    try:
        http = PoolManager()
        r = http.urlopen('GET', ctt_url, preload_content=False)
        soup = BeautifulSoup(r, "html.parser")
        records = dados_tracking
        table2 = soup.find_all('table')[1]

        l = 1
        c = 0
        for linha in table2.find_all('tr')[1:]:
            records.append([])
            for celula in linha.find_all('td')[1:]:
                txt = clean(celula.string, tags=[], strip=True).strip()
                records[l].append(txt)
                c += 1
            l += 1
            tabela = SingleTable(records)
            print(tabela.table)

        print(records)
        tabela = SingleTable(records)
        print(tabela.table)
        exit()  # This exit is only for testing purposes...



    obter_estado_detalhado("EA746813946PT")

示例HTML代码(as in this link)

<table class="full-width">
                    <thead>
                        <tr>
                            <th>
                                Nº de Objeto 
                            </th>
                            <th>
                                Produto
                            </th>
                            <th>
                                Data
                            </th>
                            <th>
                                Hora
                            </th>
                            <th>
                                Estado
                            </th>
                            <th>
                                Info
                            </th>
                        </tr>
                    </thead>

                    <tbody><tr>
                        <td>

                                EA746813813PT                                   




                        </td>    
                        <td>19</td> 
                        <td>2016/03/31</td> 
                        <td>09:40</td>

                        <td>





                                        Objeto entregue



                        </td>

                        <td class="truncate">
                            <a id="detailsLinkShow_0" onclick="toggleObjectDetails('0', true);" class="hide">[+]Info</a>
                            <a id="detailsLinkHide_0" class="" onclick="toggleObjectDetails('0', false);">[-]Info</a>
                        </td>
                    </tr>
                <tr></tr>
                <tr id="details_0" class="">
                        <td colspan="6">










                            <div class="full-width-table-scroller"><table class="full-width">
                                <thead>
                                    <tr>
                                        <th>Hora</th>
                                        <th>Estado</th>
                                        <th>Motivo</th>

                                        <th>Recetor</th>
                                    </tr>
                                </thead>

                                    <tbody><tr>

                                            </tr>

                                        <tr class="group">                          
                                                <td colspan="5">quinta-feira, 31 Março 2016</td>
                                            </tr><tr><td>09:40</td>  
                                        <td>Entrega conseguida</td> 
                                        <th>Local</th><td>-</td> 
                                        <td>4470 - MAIA</td>
                                        <td>DONIEL MARQUES</td> 
                                    </tr>   

                                    <tr>

                                        <td>08:32</td>   
                                        <td>Em distribuição</td> 
                                        <td>-</td> 
                                        <td>4470 - MAIA</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                        <td>08:29</td>   
                                        <td>Receção no local de entrega</td> 
                                        <td>-</td> 
                                        <td>4470 - MAIA</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                        <td>08:29</td>   
                                        <td>Receção nacional</td> 
                                        <td>-</td> 
                                        <td>4470 - MAIA</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                        <td>00:17</td>   
                                        <td>Envio</td> 
                                        <td>-</td> 
                                        <td>C. O. PERAFITA</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                            </tr><tr class="group">                         
                                                <td colspan="5">quarta-feira, 30 Março 2016</td>
                                            </tr>

                                        <tr><td>23:40</td>   
                                        <td>Expedição nacional</td> 
                                        <td>-</td> 
                                        <td>C.O. PERAFITA (OPE)</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                        <td>20:39</td>   
                                        <td>Receção no local de entrega</td> 
                                        <td>-</td> 
                                        <td>C. O. PERAFITA</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                        <td>20:39</td>   
                                        <td>Receção nacional</td> 
                                        <td>-</td> 
                                        <td>C. O. PERAFITA</td>
                                        <td>-</td>  
                                    </tr>   

                                    <tr>

                                        <td>20:39</td>   
                                        <td>Aceitação</td> 
                                        <td>-</td> 
                                        <td>C. O. PERAFITA</td>
                                        <td>-</td>  
                                    </tr>   

                            </tbody></table></div>





                        </td>
                </tr>       

        </tbody></table>

1 个答案:

答案 0 :(得分:1)

这匹配主表输出:

from bs4 import BeautifulSoup

html = requests.get("http://www.cttexpresso.pt/feapl_2/app/open/cttexpresso/objectSearch/objectSearch.jspx?lang=def&objects=EA746813946PT&showResults=true").content

soup = BeautifulSoup(html)

# get table using id
rows = soup.select("#details_0")[0]

# get the header names and strip whitespace
cols = [th.text.strip() for th in rows.select("th")]

# extract all td's from each table row, the list comp will data grouped row wise.
data = [[td.text.strip() for td in tr.select("td")] for tr in rows.select("tr")]

print(" ".join(cols))
for row in data:
    print(", ".join(row))

输出:

Hora Estado Motivo Local Recetor


terça-feira, 5 Abril 2016
07:58, Em distribuição, -, 4000 - PORTO, -
00:35, Envio, -, C. O. PERAFITA, -
00:20, Expedição nacional, -, C.O. PERAFITA (OPE), -

segunda-feira, 4 Abril 2016
21:45, Receção nacional, -, C. O. PERAFITA, -
21:45, Aceitação, -, C. O. PERAFITA, -

网站:

enter image description here

这是解析器,我以为我尝试了所有的坚果唯一有效的是 html5 使用soup = BeautifulSoup(html,"html5")输出:

Hora Estado Motivo Local Recetor


terça-feira, 5 Abril 2016
11:02, Entrega conseguida, -, 4000 - PORTO, CANDIDA VIEGAS
07:58, Em distribuição, -, 4000 - PORTO, -
00:35, Envio, -, C. O. PERAFITA, -
00:20, Expedição nacional, -, C.O. PERAFITA (OPE), -

segunda-feira, 4 Abril 2016
21:45, Receção no local de entrega, -, C. O. PERAFITA, -
21:45, Receção nacional, -, C. O. PERAFITA, -
21:45, Aceitação, -, C. O. PERAFITA, -