https://shipandbunker.com/prices/emea/nwe/nl-rtm-rotterdam#_IFO380
我想在上述网站的动态图表中刮除鹿特丹-IFO380的价格,价格为2019年4月4日($ 380.50)。
我不确定如果要将数据存储到本地数据库中,beautifulsoup是否是最好的方法?
from bs4 import BeautifulSoup
import requests
import pymongo
# URL of page to be scraped
url = 'https://shipandbunker.com/prices/emea/nwe/nl-rtm-rotterdam#IFO380'
# Retrieve page with the requests module
response = requests.get(url)
# Create BeautifulSoup object; parse with 'lxml'
soup = BeautifulSoup(response.text, 'lxml')
答案 0 :(得分:3)
您可以使用Scrapy
这是一个容易抓取的蜘蛛:
URL:https://shipandbunker.com/prices/emea/nwe/nl-rtm-rotterdam#_IFO380
import scrapy
class ShippingSpider(scrapy.Spider):
name = 'shipping_spider'
start_urls = [
'https://shipandbunker.com/prices/emea/nwe/nl-rtm-rotterdam#_IFO380',
]
def parse(self, response):
xpath = '//*[@id="block_284"]/div/div/div/table/tbody/tr[2]/td[1]/text()'
rotterdam = response.xpath(xpath).extract()
print(rotterdam)
# output: ['315.00']
设置和管理蜘蛛非常简单,这里是Doc。
答案 1 :(得分:0)
此页面使用JavaScript获取所有数据并创建图形。
JavaScript使用POST
请求和URL https://shipandbunker.com/a/.json来获取JSON格式的数据,该数据可以轻松转换为Python的字典,并且不需要BeautifulSoup来抓取HTML。
import requests
import datetime
day = datetime.date(2019, 7, 4)
payload = {
'api-method': 'pricesForAllSeriesGet',
'resource': 'MarketPriceGraph_Block',
'mc0': 'NL RTM',
'mc1': 'AV G20',
}
url = 'https://shipandbunker.com/a/.json'
r = requests.post(url, data=payload)
#print(r.content)
data = r.json()
for number, value in data['api']['NL RTM']['data']['prices']['IFO380']['dayprice']:
# convert day number to date object
timestamp = data['api']['NL RTM']['data']['day_list']['IFO380'][str(number)]
date = datetime.date.fromtimestamp(timestamp/1000)
if date == day:
print(day, value)
break
它显示
2019-07-04 380.5
答案 2 :(得分:0)
这是一种在从网络标签中找到的API调用中动态返回的json中搜索特定日期的方法。与其他答案的实现略有不同。
import requests
from datetime import datetime, date
import calendar
def get_timestamp(date_var):
return calendar.timegm(date_var.timetuple()) * 1000
data = [
('api-method', 'pricesForAllSeriesGet'),
('resource', 'MarketPriceGraph_Block'),
('mc0', 'NL RTM'),
('mc1', 'AV G20')]
r = requests.post('https://shipandbunker.com/a/.json', data=data).json()
date_var = get_timestamp(date(2019, 7, 4))
d = r['api']['NL RTM']['data']['day_list']['IFO380']
keys = list(d.keys())
prices = r['api']['NL RTM']['data']['prices']['IFO380']['dayprice']
found = [i for i in range(len(d.keys())) if d[keys[i]] == date_var][0]
print(prices[found][1])