我正在尝试使用Python从html表中获取值。 html看起来像这样:
<table border=1 width=900>
<tr><td width=50%>
<table>
<tr><td align=right><b>Invoice #</td><td><input type=text value="1624140" size=12></td></tr>
<tr><td align=right>Company</td><td><input type=text value="NZone" size=40></td></tr>
<tr><td align=right>Name:</td><td><input type=text value="John Dot" size=40></td></tr>
<tr><td align=right>Address:</td><td><input type=text value="Posie Row, Moscow Road" size=40></td></tr>
<tr><td align=right>City:</td><td><input type=text value="Co. Dubllin" size=40></td></tr>
<tr><td align=right>Province</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Postal Code:</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Country:</td><td><input type=text value="IRELAND" size=40></td></tr>
<tr><td align=right>Date:</td><td><input type=text value="24.4.18" size=12></td></tr>
<tr><td align=right>Sub Total:</td><td><input type=text value="93,24" size=40></td></tr>
<tr><td align=right>Combined Weight:</td><td><input type=text value="1,24" size=40></td></tr>
</table>
到目前为止,我的代码是:
from __future__ import print_function
import requests
import re
from bs4 import BeautifulSoup as bs
request = requests.get('url')
content = request.content
soup = bs(content, 'html.parser')
table = soup.findChildren('table')[1]
rows = table.findChildren('tr')
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
print(cell_content)
输出为:
Invoice #
Company
Name:
Address:
City:
Province
Postal Code:
Country:
Date:
Sub Total:
Combined Weight:
我想要如下的最终输出:
Invoice:1624140
Company:NZone
Name:John Dot
Address:Possie Row, Moscow Road
City:Co. Dublin
Province:
Postal Code:
Country:IRELAND
Date:24.4.18
Sub Total:93,24
Combined Weight:1,24
答案 0 :(得分:2)
data = """
<table border=1 width=900>
<tr><td width=50%>
<table>
<tr><td align=right><b>Invoice #</td><td><input type=text value="1624140" size=12></td></tr>
<tr><td align=right>Company</td><td><input type=text value="NZone" size=40></td></tr>
<tr><td align=right>Name:</td><td><input type=text value="John Dot" size=40></td></tr>
<tr><td align=right>Address:</td><td><input type=text value="Posie Row, Moscow Road" size=40></td></tr>
<tr><td align=right>City:</td><td><input type=text value="Co. Dubllin" size=40></td></tr>
<tr><td align=right>Province</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Postal Code:</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Country:</td><td><input type=text value="IRELAND" size=40></td></tr>
<tr><td align=right>Date:</td><td><input type=text value="24.4.18" size=12></td></tr>
<tr><td align=right>Sub Total:</td><td><input type=text value="93,24" size=40></td></tr>
<tr><td align=right>Combined Weight:</td><td><input type=text value="1,24" size=40></td></tr>
</table>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for (td, inp) in zip(soup.find_all('td', align="right"), soup.find_all('input')):
print(td.text, inp['value'])
输出为:
Invoice # 1624140
Company NZone
Name: John Dot
Address: Posie Row, Moscow Road
City: Co. Dubllin
Province
Postal Code:
Country: IRELAND
Date: 24.4.18
Sub Total: 93,24
Combined Weight: 1,24
答案 1 :(得分:1)
词典理解如何?
d = {k.findChild('td').getText().strip():k.findChild('input')['value'] for k in rows}
结果是这样的字典:
{'Address:': 'Posie Row, Moscow Road',
'City:': 'Co. Dubllin',
'Combined Weight:': '1,24',
'Company': 'NZone',
'Country:': 'IRELAND',
'Date:': '24.4.18',
'Invoice #': '1624140',
'Name:': 'John Dot',
'Postal Code:': '',
'Province': '',
'Sub Total:': '93,24'}
答案 2 :(得分:0)
分配行对象之后,也许您打算编写此代码?因为您当前的代码有一些缩进错误。请查看它是否可以解决您的问题。
rows = table.findChildren('tr')
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
print(cell_content)
答案 3 :(得分:0)
为此替换您的底部循环:
for row in rows:
[row_title, row_val] = row.findChildren('td')
print(row_title.getText(), row_val.input['value'])
此代码解压缩每行中的两个单元格。然后,它获取行标题的左侧td
的直接子文本,并向下钻取右侧td
的值。