使用Python获取表值

时间:2018-07-18 12:07:39

标签: python beautifulsoup

我正在尝试使用Python从html表中获取值。 html看起来像这样:

<table border=1 width=900>
 <tr><td width=50%>
<table>
    <tr><td align=right><b>Invoice #</td><td><input type=text value="1624140" size=12></td></tr>
    <tr><td align=right>Company</td><td><input type=text value="NZone" size=40></td></tr>
    <tr><td align=right>Name:</td><td><input type=text value="John Dot" size=40></td></tr>
    <tr><td align=right>Address:</td><td><input type=text value="Posie Row, Moscow Road" size=40></td></tr>
    <tr><td align=right>City:</td><td><input type=text value="Co. Dubllin" size=40></td></tr>
    <tr><td align=right>Province</td><td><input type=text value="" size=40></td></tr>
    <tr><td align=right>Postal Code:</td><td><input type=text value="" size=40></td></tr>
    <tr><td align=right>Country:</td><td><input type=text value="IRELAND" size=40></td></tr>
    <tr><td align=right>Date:</td><td><input type=text value="24.4.18" size=12></td></tr>
    <tr><td align=right>Sub Total:</td><td><input type=text value="93,24" size=40></td></tr>
    <tr><td align=right>Combined Weight:</td><td><input type=text value="1,24" size=40></td></tr>
</table>

到目前为止,我的代码是:

from __future__ import print_function
import requests
import re

from bs4 import BeautifulSoup as bs

request = requests.get('url')

content = request.content

soup = bs(content, 'html.parser')  

table = soup.findChildren('table')[1]

rows = table.findChildren('tr')

for row in rows:
cells = row.findChildren('td')
for cell in cells:
    cell_content = cell.getText()

 print(cell_content)

输出为:

Invoice #
Company
Name:
Address:
City:
Province
Postal Code:
Country:
Date:
Sub Total:
Combined Weight:

我想要如下的最终输出:

Invoice:1624140
Company:NZone
Name:John Dot
Address:Possie Row, Moscow Road
City:Co. Dublin
Province:
Postal Code:
Country:IRELAND
Date:24.4.18
Sub Total:93,24
Combined Weight:1,24

4 个答案:

答案 0 :(得分:2)

data = """
<table border=1 width=900>
 <tr><td width=50%>
<table>
    <tr><td align=right><b>Invoice #</td><td><input type=text value="1624140" size=12></td></tr>
    <tr><td align=right>Company</td><td><input type=text value="NZone" size=40></td></tr>
    <tr><td align=right>Name:</td><td><input type=text value="John Dot" size=40></td></tr>
    <tr><td align=right>Address:</td><td><input type=text value="Posie Row, Moscow Road" size=40></td></tr>
    <tr><td align=right>City:</td><td><input type=text value="Co. Dubllin" size=40></td></tr>
    <tr><td align=right>Province</td><td><input type=text value="" size=40></td></tr>
    <tr><td align=right>Postal Code:</td><td><input type=text value="" size=40></td></tr>
    <tr><td align=right>Country:</td><td><input type=text value="IRELAND" size=40></td></tr>
    <tr><td align=right>Date:</td><td><input type=text value="24.4.18" size=12></td></tr>
    <tr><td align=right>Sub Total:</td><td><input type=text value="93,24" size=40></td></tr>
    <tr><td align=right>Combined Weight:</td><td><input type=text value="1,24" size=40></td></tr>
</table>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for (td, inp) in zip(soup.find_all('td', align="right"), soup.find_all('input')):
    print(td.text, inp['value'])

输出为:

Invoice # 1624140
Company NZone
Name: John Dot
Address: Posie Row, Moscow Road
City: Co. Dubllin
Province 
Postal Code: 
Country: IRELAND
Date: 24.4.18
Sub Total: 93,24
Combined Weight: 1,24

答案 1 :(得分:1)

词典理解如何?

d = {k.findChild('td').getText().strip():k.findChild('input')['value'] for k in rows}

结果是这样的字典:

{'Address:': 'Posie Row, Moscow Road',
 'City:': 'Co. Dubllin',
 'Combined Weight:': '1,24',
 'Company': 'NZone',
 'Country:': 'IRELAND',
 'Date:': '24.4.18',
 'Invoice #': '1624140',
 'Name:': 'John Dot',
 'Postal Code:': '',
 'Province': '',
 'Sub Total:': '93,24'}

答案 2 :(得分:0)

分配行对象之后,也许您打算编写此代码?因为您当前的代码有一些缩进错误。请查看它是否可以解决您的问题。

rows = table.findChildren('tr')

for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
        cell_content = cell.getText()
        print(cell_content)

答案 3 :(得分:0)

为此替换您的底部循环:

for row in rows:
    [row_title, row_val] = row.findChildren('td')

    print(row_title.getText(), row_val.input['value'])

此代码解压缩每行中的两个单元格。然后,它获取行标题的左侧td的直接子文本,并向下钻取右侧td的值。