我正在尝试将HTML中的表转换为Python中的csv。我要提取的表就是这个表:
<table class="tblperiode">
<caption>Dades de període</caption>
<tr>
<th class="sortable"><span class="tooltip" title="Període (Temps Universal)">Període</span><br/>TU</th>
<th><span class="tooltip" title="Temperatura mitjana (°C)">TM</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura màxima (°C)">TX</span><br/>°C</th>
<th><span class="tooltip" title="Temperatura mínima (°C)">TN</span><br/>°C</th>
<th><span class="tooltip" title="Humitat relativa mitjana (%)">HRM</span><br/>%</th>
<th><span class="tooltip" title="Precipitació (mm)">PPT</span><br/>mm</th>
<th><span class="tooltip" title="Velocitat mitjana del vent (km/h)">VVM (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Direcció mitjana del vent (graus)">DVM (10 m)</span><br/>graus</th>
<th><span class="tooltip" title="Ratxa màxima del vent (km/h)">VVX (10 m)</span><br/>km/h</th>
<th><span class="tooltip" title="Irradiància solar global mitjana (W/m2)">RS</span><br/>W/m<sup>2</sup></th>
</tr>
<tr>
<th>
00:00 - 00:30
</th>
<td>16.2</td>
<td>16.5</td>
<td>15.4</td>
<td>93</td>
<td>0.0</td>
<td>6.5</td>
<td>293</td>
<td>10.4</td>
<td>0</td>
</tr>
<tr>
<th>
00:30 - 01:00
</th>
<td>16.4</td>
<td>16.5</td>
<td>16.1</td>
<td>90</td>
<td>0.0</td>
<td>5.8</td>
<td>288</td>
<td>8.6</td>
<td>0</td>
</tr>
我希望它看起来像这样:
要实现这一点,我试图解析html,并且设法通过以下操作正确地建立了数据的数据框:
from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.select_one("table.tblperiode")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
df = pd.DataFrame(output_rows)
print(df)
但是,我想使用列名和一个指示时间间隔的列,在上面的html示例中,其中只有两个出现在00:00-00:30和00:30 1:00。因此,我的表应具有两行,其中一行对应于00:00-00:30的观察值,另一行对应于00:30和1:00的观察值。
如何从HTML中获取此信息?
答案 0 :(得分:0)
这是一种实现方法,它可能不是最好的方法,但它可行!您可以通读注释以弄清楚代码在做什么!
from bs4 import BeautifulSoup
import csv
#read the html
html = open("table.html").read()
soup = BeautifulSoup(html, 'html.parser')
# get the table from html
table = soup.select_one("table.tblperiode")
# find all rows
rows = table.findAll('tr')
# strip the header from rows
headers = rows[0]
header_text = []
# add the header text to array
for th in headers.findAll('th'):
header_text.append(th.text)
# init row text array
row_text_array = []
# loop through rows and add row text to array
for row in rows[1:]:
row_text = []
# loop through the elements
for row_element in row.findAll(['th', 'td']):
# append the array with the elements inner text
row_text.append(row_element.text.replace('\n', '').strip())
# append the text array to the row text array
row_text_array.append(row_text)
# output csv
with open("out.csv", "w") as f:
wr = csv.writer(f)
wr.writerow(header_text)
# loop through each row array
for row_text_single in row_text_array:
wr.writerow(row_text_single)
答案 1 :(得分:0)
使用此脚本:
import csv
from bs4 import BeautifulSoup
html = open('table.html').read()
soup = BeautifulSoup(html, features='lxml')
table = soup.select_one('table.tblperiode')
rows = []
for i, table_row in enumerate(table.findAll('tr')):
if i > 0:
periode = [' '.join(table_row.findAll('th')[0].text.split())]
data = [x.text for x in table_row.findAll('td')]
rows.append(periode + data)
header = ['Periode', 'TM', 'TX', 'TN', 'HRM', 'PPT', 'VVM', 'DVM', 'VVX', 'PM', 'RS']
with open('result.csv', 'w', newline='') as f:
w = csv.writer(f)
w.writerow(header)
w.writerows(rows)
我设法在输出中生成以下CSV文件:
Periode,TM,TX,TN,HRM,PPT,VVM,DVM,VVX,PM,RS
00:00 - 00:30,16.2,16.5,15.4,93,0.0,6.5,293,10.4,0
00:30 - 01:00,16.4,16.5,16.1,90,0.0,5.8,288,8.6,0
答案 2 :(得分:0)
import csv
from bs4 import BeautifulSoup
import pandas as pd
html = open('test.html').read()
soup = BeautifulSoup(html, features='lxml')
#Specify table name which you want to read.
#Example: <table class="queryResults" border="0" cellspacing="1">
table = soup.select_one('table.queryResults')
def get_all_tables(soup):
return soup.find_all("table")
tbls = get_all_tables(soup)
for i, tablen in enumerate(tbls, start=1):
print(i)
print(tablen)
def get_table_headers(table):
headers = []
for th in table.find("tr").find_all("th"):
headers.append(th.text.strip())
return headers
head = get_table_headers(table)
#print(head)
def get_table_rows(table):
rows = []
for tr in table.find_all("tr")[1:]:
cells = []
# grab all td tags in this table row
tds = tr.find_all("td")
if len(tds) == 0:
# if no td tags, search for th tags
# can be found especially in wikipedia tables below the table
ths = tr.find_all("th")
for th in ths:
cells.append(th.text.strip())
else:
# use regular td tags
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
return rows
table_rows = get_table_rows(table)
#print(table_rows)
def save_as_csv(table_name, headers, rows):
pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")
save_as_csv("Test_table", head, table_rows)