使用BeautifulSoup提取表

时间:2018-07-17 12:46:34

标签: python beautifulsoup

我想使用BeautifulSoup从html文件中提取所有表格,如下所示,并将其写入csv。

HTML如下所示:

        <h4>Site Name : Aria</h4>   
            <table style="width: 100%">
                <tbody><tr>
                    <th style="width: 25%"><strong>Dn Name:</strong></th>
                    <td style="width: 25%"><strong>Aria</strong></td>

                        <th style="width: 25%"><strong>WL:</strong></th>
                        <td style="width: 25%"><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>River Name:</strong></th>
                    <td><strong>Ben</strong></td>

                        <th><strong>DL:</strong></th>
                        <td><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>Basin Name:</strong></th>
                    <td><strong>GAN<strong></strong></strong></td>

                    <th><strong>HFL:</strong></th>
                    <td><strong>49.4 Meters (m)<strong></strong></strong></td>

                </tr>
                <tr>
                    <th><strong>Div Name:</strong></th>

                    <td><a target="_blank" href="http://imd.gov.in/ onclick="window.open(this.href, this.target, &#39;width=1000, height=600, toolbar=no, resizable=no&#39;); return false;">LGD-I</a></td>

                    <th><strong>HFL date:</strong></th>
                    <td>14-08-2017</td>

                </tr>
            </tbody></table>
            <p>&nbsp;</p>
            <table>
                <tbody><tr>
                    <th colspan="3" style="text-align: center;"><strong>PRESENT WL</strong></th>
                </tr>

                <tr>                            

                    <td class="" style="width:33%; height:18px;">Date: 17-07-2018 12:00</td>
                    <td class="" style="width:33%;">Value: 45.43 Meters (m)</td>
                    <td class="" style="width:33%;">Trend: Steady</td>
                </tr>
                <tr>
                    <th colspan="3" style="text-align: center;"><strong>CUMULATIVE DAILY RF</strong></th>
                </tr>
                <tr>

                        <td style="width:33%; height:18px;">Date: 17-07-2018 08:30</td>
                        <td style="width:33%;">Value: 0.0 Milimiters (mm)</td>
                        <td style="width:33%;"></td>

                </tr>
            </tbody></table>                            
                <p>&nbsp;</p>                       



                            <table style="width: 100%">
                                <tbody><tr>
                                    <th colspan="4" style="text-align: center;"><strong>NO FORECAST</strong></th>
                                </tr>
                            </tbody></table>




</div>

我尝试从这三个表中提取文本,但无法以所需的格式写出文本

我的代码

now = datetime.datetime.now()
date = now.strftime("%d-%m-%Y")
os.chdir(r'D:\shared')


soup = BeautifulSoup(response.text,"html5lib")

tables = soup.find_all("tr")
test =[]
for table in tables:
    test.append(table.get_text())

filename = 'Water'+'-'+str(date)+'.csv'
out = open(filename, mode='ab')
writer = csv.writer(out)
writer.writerow(data)
out.close()

在csv输出中,第一个表写入第一列,第二个表写入第二个表,第三个表写入第三列。

我想要以下格式的数据:

Site Name:  Aria
Dn Name:    Aria    
WL:         Meters (m)
River Name: Ben 
DL:         Meters (m)
Basin Name: GAN
HFL:        49.4 Meters (m)
Div Name:   LGD-I)
HFL date:   14-08-2017

PRESENT WL
Date:       17-07-2018 12:00    
Value:      45.43 Meters (m)    
Trend:      Steady
CUMULATIVE 
DAILY RF
Date:       17-07-2018 08:30    
Value:      0.0 Milimiters (mm) 
NO FORECAST

1 个答案:

答案 0 :(得分:2)

我对这个问题的尝试:

@if(session()->has('foo'))
    @if( ! empty( session('foo')->bar ) )
        {{ session('foo')->bar }}
    @else
       empty
    @endif
@endif

打印:

data = """
        <h4>Site Name : Aria</h4>
            <table style="width: 100%">
                <tbody><tr>
                    <th style="width: 25%"><strong>Dn Name:</strong></th>
                    <td style="width: 25%"><strong>Aria</strong></td>

                        <th style="width: 25%"><strong>WL:</strong></th>
                        <td style="width: 25%"><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>River Name:</strong></th>
                    <td><strong>Ben</strong></td>

                        <th><strong>DL:</strong></th>
                        <td><strong> Meters (m)</strong></td>

                </tr>
                <tr>
                    <th><strong>Basin Name:</strong></th>
                    <td><strong>GAN<strong></strong></strong></td>

                    <th><strong>HFL:</strong></th>
                    <td><strong>49.4 Meters (m)<strong></strong></strong></td>

                </tr>
                <tr>
                    <th><strong>Div Name:</strong></th>

                    <td><a target="_blank" href="http://imd.gov.in/ onclick="window.open(this.href, this.target, &#39;width=1000, height=600, toolbar=no, resizable=no&#39;); return false;">LGD-I</a></td>

                    <th><strong>HFL date:</strong></th>
                    <td>14-08-2017</td>

                </tr>
            </tbody></table>
            <p>&nbsp;</p>
            <table>
                <tbody><tr>
                    <th colspan="3" style="text-align: center;"><strong>PRESENT WL</strong></th>
                </tr>

                <tr>

                    <td class="" style="width:33%; height:18px;">Date: 17-07-2018 12:00</td>
                    <td class="" style="width:33%;">Value: 45.43 Meters (m)</td>
                    <td class="" style="width:33%;">Trend: Steady</td>
                </tr>
                <tr>
                    <th colspan="3" style="text-align: center;"><strong>CUMULATIVE DAILY RF</strong></th>
                </tr>
                <tr>

                        <td style="width:33%; height:18px;">Date: 17-07-2018 08:30</td>
                        <td style="width:33%;">Value: 0.0 Milimiters (mm)</td>
                        <td style="width:33%;"></td>

                </tr>
            </tbody></table>
                <p>&nbsp;</p>



                            <table style="width: 100%">
                                <tbody><tr>
                                    <th colspan="4" style="text-align: center;"><strong>NO FORECAST</strong></th>
                                </tr>
                            </tbody></table>
</div>"""

import os
import datetime
from bs4 import BeautifulSoup
from pprint import pprint
# For Python 2.7 the next line should be "from itertools import izip_longest"
from itertools import zip_longest
import csv

now = datetime.datetime.now()
date = now.strftime("%d-%m-%Y")
# os.chdir(r'D:\shared')

soup = BeautifulSoup(data, "lxml")

tables = []
for table in soup.find_all('table'):
    current_table = []
    tables.append(current_table)
    for row in table.find_all("tr"):
        for (th, td) in zip_longest(row.find_all('th'), row.find_all('td')):
            s = ("%s %s" % (th.text.strip() if th else '', td.text.strip() if td else '')).strip()
            if s:
                current_table.append(s)

tables[0].insert(0, ': '.join(w.strip() for w in soup.find('h4').text.split(':')))

for table in tables:
    for i in table:
        print(i)

filename = 'CWC-Water'+'-'+str(date)+'.csv'
out = open(filename, mode='w')
writer = csv.writer(out)
for table in zip_longest(*tables):
    writer.writerow(table)
out.close()

并以以下格式输出.csv文件(表中的3列,LibreOffice的屏幕截图):

enter image description here

编辑: -正确的图片