使用bs4抓取数据时返回[]

时间:2020-06-20 10:06:58

标签: python selenium web-scraping beautifulsoup scrapy

我正在尝试从某个网站上抓取数据,但到目前为止还没有成功。我尝试了几种方法 最有希望的就是这个。我正在尝试从网站获取yearBuild。有人可以帮我吗。任何线索都将受到高度赞赏

import bs4 as bs
from selenium import webdriver  
wd = webdriver.Chrome()
url = ("https://www.marinetraffic.com/en/ais/details/ships/mmsi:255805792")
wd.get(url)
html_source = wd.page_source
wd.quit()
soup = bs.BeautifulSoup(html_source)
elems = soup.select('#yearBuild > b')
print(elems)
print(soup.prettify())

此处elems作为空列表返回

2 个答案:

答案 0 :(得分:1)

您可以使用他们的API获取有关飞船的信息。

例如:

import re
import json
import requests


url = 'https://www.marinetraffic.com/en/ais/details/ships/mmsi:255805792'

ship_info_url = 'https://www.marinetraffic.com/en/vesselDetails/vesselInfo/shipid:{ship_id}'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

r = requests.get(url, headers=headers)
ship_id = re.search(r'shipid:(\d+)', r.url)[1]
data = requests.get(ship_info_url.format(ship_id=ship_id), headers=headers).json()

print(json.dumps(data, indent=4))
print('Year Built = ', data['yearBuilt'])

打印:

{
    "name": "LAILA",
    "nameAis": "LAILA",
    "imo": 9377559,
    "eni": null,
    "mmsi": 255805792,
    "callsign": "CQDP",
    "country": "Portugal",
    "countryCode": "PT",
    "type": "Cargo - Hazard A (Major)",
    "typeSpecific": "Container Ship",
    "typeColor": "7",
    "grossTonnage": 28048,
    "deadweight": 38080,
    "teu": 2700,
    "liquidGas": null,
    "length": 215.5,
    "breadth": 29.87,
    "yearBuilt": 2008,
    "status": "Active",
    "isNavigationalAid": false,
    "correspondingRoamingStationId": null,
    "homePort": null
}
Year Built =  2008

答案 1 :(得分:0)

我可以建议使用VesselFinder而不是MarineTraffic吗?数据是相同的,但是MarineTraffic很难抓取,因为它全部是JavaScript,而VesselFinder可以只使用BeautifulSoup抓取。

VesselFinder还使用表格来显示数据,因此很容易用熊猫进行解析。

这是代码:

import pandas as pd
import requests

r = requests.get('https://www.vesselfinder.com/vessels/LAILA-IMO-9377559-MMSI-255805792', headers={'User-Agent': 'iPhone'})

df = pd.read_html(r.text)
ship = ship = pd.concat([df[2], df[3]], ignore_index=True).set_index(0).to_dict()[1]

print(ship['Year of Built'])