使用BeautifulSoup刮取HTML表格

时间:2018-03-28 19:31:24

标签: python beautifulsoup

我正在尝试从ISP Cable Modem中扫描每个通道的上游和下游值。

我无法正确显示数据。 我希望数据以标准化的CSV格式输出以进行记录。

这是我的代码,但是它非常不合适。

import requests
from bs4 import BeautifulSoup

# Collect and parse first page
page = requests.get('http://192.168.100.1/Docsis_system.asp')
soup = BeautifulSoup(page.text, 'html.parser')

# Pull all text from the proper section in the page
#signal_value_list = soup.find('tbody')
signal_value_list = soup.find('table', {'summary':'Downstream Channels'})

# Pull text from all instances of <td> tag within align div
signal_value_list_items = signal_value_list.find_all('td')


# Create for loop to print out all values
for signal_value in signal_value_list_items:
    sigval = signal_value.contents[0]
    print(sigval)

我尝试解析的页面位于此链接中作为TXT文件:

Download it here

screenshot of the modem page with tables

我愿意采取其他方向来获取这些数据,但我希望我能够比完成这项工作更轻松地完成这项工作。

有人有想法吗?

1 个答案:

答案 0 :(得分:0)

我不确定我是否完全理解您要解析的数据是什么,但假设我们正在谈论(例如)下游频道&#39;功率电平和SNR,一种可能的解决方案如下:

import requests
from bs4 import BeautifulSoup
import csv
# Collect and parse first page
# page = requests.get('http://192.168.100.1/Docsis_system.asp')
with open('files/modem-page.html') as f:
    text = f.read()
soup = BeautifulSoup(text, 'html.parser')

# Pull all text from the proper section in the page
# signal_value_list = soup.find('tbody')
signal_value_list = soup.find('table', {'summary': 'Downstream Channels'})

# Pull text from all instances of <td> tag within align div
signal_value_list_items = signal_value_list.find_all('td')
res = {}

# Create for loop to print out all values
for signal_value in signal_value_list_items:
    try:
        channel_string = signal_value.attrs.get('headers')[0]   # get name of channel
        property_string = signal_value.attrs.get('headers')[1]  # get name of property
        value_string = signal_value.text    # get actual value
        if channel_string not in res:
            res[channel_string] = {}
        res[channel_string][property_string] = value_string
    except TypeError as e:  # for all irrelevant elements
        continue

with open('res.csv', 'wb+') as csv_file:    # add column names to your liking before this
    writer = csv.writer(csv_file)
    for key, value in res.items():
        writer.writerow([key] + [value[prop] for prop in value])
print res

也可以为上游数据编写类似的代码(我将留给您)。可能的改进是对频道名称进行排序(我使用字典,因此顺序是随机的)并添加列名称。希望这有帮助