使用BeautifulSoup从div中删除所有表

时间:2016-05-27 15:28:51

标签: python html web-scraping beautifulsoup

我需要从<tr>中的所有表格中提取<div id="specs-list">代码。然而,它只取得前六名表。这是page。这是我的代码。

def getPhoneStats(url):
    urls={}
    try:
        request= requests.get(url)
        if request.status_code == 200:
            sourceCode = BeautifulSoup(request.content,"html.parser")
            tables = sourceCode.select('#specs-list table')
            for table in tables:
                tag = table.find('tr')
                print(tag.get_text())
        else:
            print('no table or row found ')
    except requests.HTTPError as e:
        print('Unable to open url',e)

它只打印到div的第6个表:

Network
Technology
GSM / HSPA / LTE


Launch
Announced
2015, March


Body
Dimensions
152.6 x 76.2 x 8 mm (6.01 x 3.00 x 0.31 in)


Display
Type
IPS capacitive touchscreen, 16M colors


Platform
OS
Android OS, v5.0.2 (Lollipop), upgradable to v6.0 (Marshmallow)


Memory
Card slot
microSD, up to 32 GB (dedicated slot)

Process finished with exit code 0

2 个答案:

答案 0 :(得分:3)

HTML格式不正确。记忆&#34;记忆&#34; table最后有太多的/ td和/ tr标签。我认为这会搞乱解析器。跳过div并直接查看表格,我有更好的运气:

['Network', 'Technology', 'GSM / HSPA / LTE']
['Network', '2G bands', 'GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2']
['Network', '\xa0', 'GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2 - India']
['Network', '3G bands', 'HSDPA 850 / 900 / 1900 / 2100 ']
['Network', '\xa0', 'HSDPA 2100 - India']
['Network', '4G bands', 'LTE band 1(2100), 3(1800), 7(2600), 38(2600), 39(1900), 40(2300), 41(2500)']
['Network', 'Speed', 'HSPA, TD-SCDMA, LTE, TD-LTE']
['Network', 'GPRS', 'Yes']
['Network', 'EDGE', 'Yes']
['Launch', 'Announced', '2015, March']
['Launch', 'Status', 'Available. Released 2015, March']
['Body', 'Dimensions', '152.6 x 76.2 x 8 mm (6.01 x 3.00 x 0.31 in)']
['Body', 'Weight', '150 g (5.29 oz)']
['Body', 'SIM', 'Dual SIM (Micro-SIM, dual stand-by)']
['Display', 'Type', 'IPS capacitive touchscreen, 16M colors']
['Display', 'Size', '5.5 inches (~71.7% screen-to-body ratio)']
['Display', 'Resolution', '1080 x 1920 pixels (~401 ppi pixel density)']
['Display', 'Multitouch', 'Yes, up to 5 fingers']
['Display', '\xa0', '- Lenovo Vibe 2.0']
['Platform', 'OS', 'Android OS, v5.0.2 (Lollipop), upgradable to v6.0 (Marshmallow)']
['Platform', 'Chipset', 'Mediatek MT6752']
['Platform', 'CPU', 'Octa-core 1.7 GHz Cortex-A53']
['Platform', 'GPU', 'Mali-T760MP2']
['Memory', 'Card slot', 'microSD, up to 32 GB (dedicated slot)']
['Memory', 'Internal', '16 GB, 2 GB RAM']
['Camera', 'Primary', '13 MP, f/2.0, autofocus, dual-LED flash, check quality']
['Camera', 'Features', 'Geo-tagging, touch focus, face detection, HDR, panorama']
['Camera', 'Video', '1080p@30fps, check quality']
['Camera', 'Secondary', '5 MP, f/2.4']
['Sound', 'Alert types', 'Vibration; MP3, WAV ringtones']
['Sound', 'Loudspeaker ', 'Yes']
['Sound', '3.5mm jack ', 'Yes']
['Sound', '\xa0', '- Dolby Atmos']
['Comms', 'WLAN', 'Wi-Fi 802.11 b/g/n, hotspot']
['Comms', 'Bluetooth', 'v4.1, A2DP, LE']
['Comms', 'GPS', 'Yes, with A-GPS, GLONASS']
['Comms', 'Radio', 'FM radio']
['Comms', 'USB', 'microUSB v2.0, USB Host']
['Features', 'Sensors', 'Accelerometer, gyro, proximity, compass']
['Features', 'Messaging', 'SMS(threaded view), MMS, Email, Push Mail, IM']
['Features', 'Browser', 'HTML5']
['Features', 'Java', 'No']
['Features', '\xa0', '- Active noise cancellation with dedicated mic\r\n- MP4/H.264 player\r\n- MP3/WAV/eAAC+/FLAC player\r\n- Photo/video editor\r\n- Document viewer']
['Battery', '\xa0', 'Removable Li-Ion 3000 mAh battery']
['Battery', 'Stand-by', 'Up to 750 h (3G)']
['Battery', 'Talk time', 'Up to 36 h (3G)']
['Misc', 'Colors', 'Onyx Black, Pearl White, Laser Yellow']
['Misc', 'Price group', '3/10 (About 150 EUR)']
['Tests', 'Performance', '\nBasemark OS II: 1053 / Basemark OS II 2.0: 984Basemark X: 5656']
['Tests', 'Display', '\nContrast ratio: 1793:1 (nominal)']
['Tests', 'Camera', '\nPhoto / Video']
['Tests', 'Loudspeaker', '\nVoice 65dB / Noise 66dB / Ring 76dB\n']
['Tests', 'Battery life', '\n\nEndurance rating 53h\n\n']
['Tests']

结果如下:

ConcurrentQueue

下次,请发布我可以运行的代码(就像我的例子)。

答案 1 :(得分:2)

这是html解析器的一个问题。我更喜欢使用html5lib,但速度较慢,所以如果速度很重要,那么基于C的解析器之一可能会更好(阅读更多here

我刚刚将sourceCode = BeautifulSoup(request.content,"html.parser")更改为sourceCode = BeautifulSoup(request.content,"html5lib")并且很方便(下面的完整更新代码)。

另外,我不确定您是否注意到这一点,但是使用tag = table.find('tr')行,您只返回每个表分组的第一行。如果你想要完整的表,只需要for循环中的print(table.get_text()

from bs4 import BeautifulSoup
import requests, html5lib
def getPhoneStats(url):
    urls={}
    try:
        request= requests.get(url)
        if request.status_code == 200:
            sourceCode = BeautifulSoup(request.content,'html5lib')
            tables = sourceCode.select('#specs-list table')
            for table in tables:
                #tag = table.find('tr')
                #print(tag.get_text())
                print(table.get_text())
        else:
            print('no table or row found ')
    except requests.HTTPError as e:
        print('Unable to open url',e)