在Python

时间:2017-02-08 17:30:50

标签: python-2.7 xml-parsing beautifulsoup

我正在尝试使用Beautifulsoup和python2.7来抓取一个网页

请求没问题,但解析不完整。 无论真正的桌子长度如何,它似乎都会停在1668个细胞周围。

以下是代码:

import os, time, string, operator, requests
from bs4 import BeautifulSoup

url='http://fse.vdkruijssen.eu/ferrylist.php'

params ={'selectplane':'Cessna 208 Caravan','submit':''}
response=requests.post(url, data=params)

soup = BeautifulSoup(response.text, "lxml")
table = soup.find(id="ferryplane")
for tr in table.find_all('tr', class_=True):  # filter the row that without text
    row = [cell.text for cell in tr.find_all('td')]
    print(row)

如何检索所有细胞?

我对网页抓取很新,任何帮助都会非常感激

谢谢!

编辑:显然代码没有问题。如图所示,我仍然得到截断的响应(最后一行)。如果您对导致这种情况的原因有所了解,请告诉我!

enter image description here

1 个答案:

答案 0 :(得分:0)

import os, time, string, operator, requests
from bs4 import BeautifulSoup

url='http://fse.vdkruijssen.eu/ferrylist.php'

params ={'selectplane':'Cessna 208 Caravan','submit':''}
response=requests.post(url, data=params)

soup = BeautifulSoup(response.text, "lxml")
table = soup.find(id="ferryplane")
for tr in table.find_all('tr', class_=True):  # filter the row that without text
    row = [cell.text for cell in tr.find_all('td')]
    print(row)

出:

['HB-TCK', 'Badenflug (carbonex)', 'LSZS', 'LSMU', '67', '1000', '670', '348', '419']
['RPC-3255', 'Bank of FSE', 'WAMR', 'RPLV', '910', '110', '1001', '-3374', '-2405']
['I-FGTY', 'Bank of FSE', 'LGEL', 'LIBN', '284', '110', '312', '-1428', '-925']
['ZT-YMC', 'Bank of FSE', 'FLEB', 'FAUT', '1230', '110', '1353', '-4560', '-3251']
['CS-PRB', 'PRA Rentals (Matt74)', 'LZKZ', 'EDDK', '561', '175', '982', '-1908', '-1180']
['ZU-YTU', 'Bank of FSE', 'FABE', 'FAJS', '409', '110', '450', '-2008', '-1300']
['ZS-FXN', 'cckohrs', 'FYML', 'FALA', '548', '200', '1096', '-2668', '-1377']
['HL-7227', 'Bank of FSE', 'RJOB', 'RKSO', '360', '110', '396', '-1483', '-971']

我确定没有丢失行: enter image description here