我试图在一天内使用Python和Scrapy从所有国家的所有机场取消所有出发和到达。
这个着名站点(飞行雷达)使用的JSON数据库需要在出发或到达时逐页查询。在一个机场100。我还根据查询的实际日UTC计算时间戳。
我尝试使用此层次结构创建数据库:
country 1
- airport 1
- departures
- page 1
- page ...
- arrivals
- page 1
- page ...
- airport 2
- departures
- page 1
- page ...
- arrivals
- page
- page ...
...
我使用两种方法按页面计算时间戳和网址查询:
def compute_timestamp(self):
from datetime import datetime, date
import calendar
# +/- 24 heures
d = date(2017, 4, 27)
timestamp = calendar.timegm(d.timetuple())
return timestamp
def build_api_call(self,code,page,timestamp):
return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
code=code, page=page, timestamp=timestamp)
我将结果存储到CountryItem
,其中包含许多AirportItem
个机场。我的item.py
是:
class CountryItem(scrapy.Item):
name = scrapy.Field()
link = scrapy.Field()
num_airports = scrapy.Field()
airports = scrapy.Field()
other_url= scrapy.Field()
last_updated = scrapy.Field(serializer=str)
class AirportItem(scrapy.Item):
name = scrapy.Field()
code_little = scrapy.Field()
code_total = scrapy.Field()
lat = scrapy.Field()
lon = scrapy.Field()
link = scrapy.Field()
departures = scrapy.Field()
arrivals = scrapy.Field()
我的主要解析为所有国家/地区构建了一个国家/地区项目(例如,我在此限制为以色列)。接下来,我为每个国家/地区scrapy.Request
收取机场费用。
###################################
# MAIN PARSE
####################################
def parse(self, response):
count_country = 0
countries = []
for country in response.xpath('//a[@data-country]'):
item = CountryItem()
url = country.xpath('./@href').extract()
name = country.xpath('./@title').extract()
item['link'] = url[0]
item['name'] = name[0]
item['airports'] = []
count_country += 1
if name[0] == "Israel":
countries.append(item)
self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)
此方法为每个机场提取信息,并且还要求每个机场scrapy.request
使用机场网址来刮取出发和到达:
###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
item = response.meta['my_country_item']
item['airports'] = []
for airport in response.xpath('//a[@data-iata]'):
url = airport.xpath('./@href').extract()
iata = airport.xpath('./@data-iata').extract()
iatabis = airport.xpath('./small/text()').extract()
name = ''.join(airport.xpath('./text()').extract()).strip()
lat = airport.xpath("./@data-lat").extract()
lon = airport.xpath("./@data-lon").extract()
iAirport = AirportItem()
iAirport['name'] = self.clean_html(name)
iAirport['link'] = url[0]
iAirport['lat'] = lat[0]
iAirport['lon'] = lon[0]
iAirport['code_little'] = iata[0]
iAirport['code_total'] = iatabis[0]
item['airports'].append(iAirport)
urls = []
for airport in item['airports']:
json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
urls.append(json_url)
if not urls:
return item
# start with first url
next_url = urls.pop()
return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})
使用递归方法parse_schedule
我将每个机场添加到国家/地区项目。在这一点上,SO成员已经help me。
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
"""we want to loop this continuously to build every departure and arrivals requests"""
item = response.meta['airport_item']
i = response.meta['i']
urls = response.meta['airport_urls']
urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])
print("urls_departures = ", len(urls_departures))
print("urls_arrivals = ", len(urls_arrivals))
## YIELD NOT CALLED
yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)
# now do next schedule items
if not urls:
yield item
return
url = urls.pop()
yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})
self.compute_urls_by_page
方法计算正确的网址,以检索一个机场的所有出发和到达。
###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
item = response.meta['airport_item']
p = response.meta['p']
i = response.meta['i']
page_urls = response.meta['page_urls']
print("PAGE URL = ", page_urls)
if not page_urls:
yield item
return
page_url = page_urls.pop()
print("GET PAGE FOR ", item['airports'][i]['name'], ">> ", p)
jsonload = json.loads(response.body_as_unicode())
json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
item['airports'][i]['departures'] = json_expression.search(jsonload)
yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})
接下来,通常调用parse_schedule
递归方法的self.parse_departure_page
中的第一个产量会产生奇怪的结果。 Scrapy打电话给这个方法,但我只收集一个机场的离境页面我不明白为什么...... 我的请求中可能有订单错误或产生源代码,所以也许你可以帮我找出来。
完整的代码在GitHub上https://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project
您可以使用scrapy cawl airports
命令运行它。
更新1:
我尝试使用yield from
单独回答问题,但没有成功,因为您可以看到答案底部...所以如果您有想法?
答案 0 :(得分:8)
是的,我终于在SO上找到了答案here
使用递归import java.io.*;
import java.net.*;
import java.util.*;
class ClientThreads extends Thread
{
public int once = 0;
public String name1 = "";
private String clientName = null;
private DataInputStream is = null;
private PrintStream os = null;
private Socket clientSocket = null;
private final ClientThreads[] threads;
private int maxClientsCount;
public int position = 0;
public int filePort = 0;
public static List<String> listName = Collections.synchronizedList(new ArrayList<String>());
List<String> l = Collections.synchronizedList(new ArrayList<String>());
public String[] namesList = new String[10];
public ClientThreads(Socket clientSocket, ClientThreads[] threads,
String name, String[] namesList, List<String> listName, int filePort)
{
this.clientSocket = clientSocket;
this.threads = threads;
maxClientsCount = threads.length;
this.name1 = name;
this.namesList = namesList;
this.filePort = filePort;
}
@SuppressWarnings("deprecation")
public void run()
{
int maxClientsCount = this.maxClientsCount;
ClientThreads[] threads = this.threads;
synchronized (listName)
{
//Iterator i = listName.iterator(); // Must be in synchronized block
ListIterator<String> i = listName.listIterator();
}
try
{
if(once==0)
{
OutputStream os2 = clientSocket.getOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(os2);
BufferedWriter bw = new BufferedWriter(osw);
once=3;
}
once=3;
System.out.println("This line is executed before inputstream and output stream are created");
is = new DataInputStream(clientSocket.getInputStream());
os = new PrintStream(clientSocket.getOutputStream());
String name;
while (true)
{
os.println("What is your name?");
name = is.readLine().trim();
break;
}
if(listName != null)
{
synchronized(listName)
{
if(!listName.contains(name))
{
listName.add(name);
os.println("The name " + name + " has been added to the list. "
+ "The list now contains " + listName);
}
}
}
synchronized (this)
{
for (int i = 0; i < maxClientsCount; i++)
{
if (threads[i] != null && threads[i] == this)
{
clientName = "@" + name;
break;
}
}
for (int i = 0; i < maxClientsCount; i++)
{
if (threads[i] != null)
{
threads[i].os.println(name + " has connected.");
os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n");
}
}
}
while (true)
{
synchronized(listName)
{
if(!listName.contains(name))
{
listName.add(name);
os.println("The name " + name + " has been added to the list. "
+ "The list now contains " + listName);
}
}
String line = is.readLine();
if (line.startsWith("/quit"))
{
break;
}
else
{
//os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n");
}
if(line.equals("x") || line.equals("X"))
{
os.println("x");
System.exit(0);
}
if(line.equals("m") || line.equals("M"))
{
os.println("Enter your message: ");
}
if(line.equals("f") || line.equals("F"))
{
os.println("Who owns the file?");
boolean keep = true;
while (keep == true)
{
String fileOwner = is.readLine();
if(fileOwner !=null && !fileOwner.isEmpty())
{
os.println(fileOwner);
System.out.println(fileOwner);
System.out.println();
//namesList[position] = fileOwner;
// os.println("The client connected name: " + fileOwner +
// " has been added to position " + position +
// " in the namesList array " + namesList[position]);
//position++;
keep = false;
os.println("Which file do you want?");
boolean keep2 = true;
while(keep2==true)
{
String filename = is.readLine();
if(filename !=null && !filename.isEmpty())
{
// os.println(filename);
keep2 = false;
os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n");
}
}
}
}
}
if (line.startsWith("@"))
{
String[] words = line.split("\\s", 2);
if (words.length > 1 && words[1] != null)
{
words[1] = words[1].trim();
if (!words[1].isEmpty())
{
synchronized (this)
{
for (int i = 0; i < maxClientsCount; i++)
{
if (threads[i] != null && threads[i] != this && threads[i].clientName != null && threads[i].clientName.equals(words[0]))
{
threads[i].os.println(name + ": " + words[1]);
this.os.println(name + ": " + words[1]);
this.os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n");
break;
}
}
}
}
}
}
else
{
synchronized (this)
{
for (int i = 0; i < maxClientsCount; i++)
{
if (!line.equals("x") && !line.equals("X") && !line.equals("f") && !line.equals("F") && !line.equals("m") && !line.equals("M") && threads[i] != null && threads[i].clientName != null && !threads[i].clientName.equals("m") && !threads[i].clientName.equals("M"))
{
threads[i].os.println(name + ": " + line + "\n");
this.os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n");
}
}
}
}
}
synchronized (this)
{
for (int i = 0; i < maxClientsCount; i++)
{
if (threads[i] != null && threads[i] != this && threads[i].clientName != null)
{
threads[i].os.println(name + "has disconnected.");
}
}
}
synchronized (this)
{
for (int i = 0; i < maxClientsCount; i++)
{
if (threads[i] == this)
{
threads[i] = null;
}
}
}
is.close();
os.close();
clientSocket.close();
} catch (IOException e)
{
}
}
}
时,您需要使用yield
。这里简化了一个例子:
yield from
更新,请勿使用真实计划:
我尝试重现相同的airport_list = ["airport1", "airport2", "airport3", "airport4"]
def parse_page_departure(airport, next_url, page_urls):
print(airport, " / ", next_url)
if not page_urls:
return
next_url = page_urls.pop()
yield from parse_page_departure(airport, next_url, page_urls)
###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(next_airport, airport_list):
## GET EACH DEPARTURE PAGE
departures_list = ["p1", "p2", "p3", "p4"]
next_departure_url = departures_list.pop()
yield parse_page_departure(next_airport,next_departure_url, departures_list)
if not airport_list:
print("no new airport")
return
next_airport_url = airport_list.pop()
yield from parse_schedule(next_airport_url, airport_list)
next_airport_url = airport_list.pop()
result = parse_schedule(next_airport_url, airport_list)
for i in result:
print(i)
for d in i:
print(d)
模式with the real program here,但我在yield from
上使用它时出错了,不明白为什么......
这里是python回溯:
scrapy.Request
答案 1 :(得分:4)
评论:...不完全清楚......你打电话给AirportData(回复,1)......这里也有点错误:self.pprint(schedule)
我使用class AirportData
来实施(限制为2页和2个航班)
更新了我的代码,删除了class AirportData
并添加了class Page
现在应该满足所有依赖项。
不是错误,self.pprint(...
是用于漂亮打印对象的class AirportsSpider Method
,就像结尾处显示的输出一样。我已经增强了class Schedule
以显示基本用法。
评论:您的答案中什么是AirportData?
编辑:class AirportData
已删除
如# ENDPOINT
所述,Page object
和page.arrivals
已分割page.departures
个广告数据。
(限2页和2个航班)
Page = [Flight 1, Flight 1, ... Flight n] schedule.airport['arrivals'] == [Page 1, Page 2, ..., Page n] schedule.airport['departures'] == [Page 1, Page 2, ..., Page n]
评论:...我们有多个页面,其中包含多个离场/到达。
是的,在第一次回答的时候,我没有任何api json
的回应
现在我收到api json
的回复,但没有反映给定的timestamp
,从current date
返回。
api params
看起来不常见,有链接到描述吗?
尽管如此,请考虑这种简化方法:
#页面对象持有一页到达/离开航班数据
class Page(object):
def __init__(self, title, schedule):
# schedule includes ['arrivals'] or ['departures]
self.current = schedule['page']['current']
self.total = schedule['page']['total']
self.header = '{}:page:{} item:{}'.format(title, schedule['page'], schedule['item'])
self.flight = []
for data in schedule['data']:
self.flight.append(data['flight'])
def __iter__(self):
yield from self.flight
#计划对象持有一个机场所有页面
class Schedule(object):
def __init__(self):
self.country = None
self.airport = None
def __str__(self):
arrivals = self.airport['arrivals'][0]
departures = self.airport['departures'][0]
return '{}\n\t{}\n\t\t{}\n\t\t\t{}\n\t\t{}\n\t\t\t{}'. \
format(self.country['name'],
self.airport['name'],
arrivals.header,
arrivals.flight[0]['airline']['name'],
departures.header,
departures.flight[0]['airline']['name'], )
#PARSE每个国家的机场
def parse_schedule(self, response):
meta = response.meta
if 'airport' in meta:
# First call from parse_airports
schedule = Schedule()
schedule.country = response.meta['country']
schedule.airport = response.meta['airport']
else:
schedule = response.meta['schedule']
data = json.loads(response.body_as_unicode())
airport = data['result']['response']['airport']
schedule.airport['arrivals'].append(Page('Arrivals', airport['pluginData']['schedule']['arrivals']))
schedule.airport['departures'].append(Page('Departures', airport['pluginData']['schedule']['departures']))
page = schedule.airport['departures'][-1]
if page.current < page.total:
json_url = self.build_api_call(schedule.airport['code_little'], page.current + 1, self.compute_timestamp())
yield scrapy.Request(json_url, meta={'schedule': schedule}, callback=self.parse_schedule)
else:
# ENDPOINT Schedule object holding one Airport.
# schedule.airport['arrivals'] and schedule.airport['departures'] ==
# List of Page with List of Flight Data
print(schedule)
#PARSE EACH AIRPORT
def parse_airports(self, response):
country = response.meta['country']
for airport in response.xpath('//a[@data-iata]'):
name = ''.join(airport.xpath('./text()').extract()[0]).strip()
if 'Charles' in name:
meta = response.meta
meta['airport'] = AirportItem()
meta['airport']['name'] = name
meta['airport']['link'] = airport.xpath('./@href').extract()[0]
meta['airport']['lat'] = airport.xpath("./@data-lat").extract()[0]
meta['airport']['lon'] = airport.xpath("./@data-lon").extract()[0]
meta['airport']['code_little'] = airport.xpath('./@data-iata').extract()[0]
meta['airport']['code_total'] = airport.xpath('./small/text()').extract()[0]
json_url = self.build_api_call(meta['airport']['code_little'], 1, self.compute_timestamp())
yield scrapy.Request(json_url, meta=meta, callback=self.parse_schedule)
#MAIN PARSE
注意:
response.xpath('//a[@data-country]')
返回所有国家/地区两次!
def parse(self, response):
for a_country in response.xpath('//a[@data-country]'):
name = a_country.xpath('./@title').extract()[0]
if name == "France":
country = CountryItem()
country['name'] = name
country['link'] = a_country.xpath('./@href').extract()[0]
yield scrapy.Request(country['link'],
meta={'country': country},
callback=self.parse_airports)
Qutput :缩短为 2 页面, 2 每页飞行次数
France Paris Charles de Gaulle Airport Departures:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696} 21:30 PM AF1558 Newcastle Airport (NCL) Air France ARJ Estimated dep 21:30 21:30 PM VY8833 Seville San Pablo Airport (SVQ) Vueling 320 Estimated dep 21:30 ... (omitted for brevity) Departures:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696} 07:30 AM AF1680 London Heathrow Airport (LHR) Air France 789 Scheduled 07:30 AM SN3628 Brussels Airport (BRU) Brussels Airlines 733 Scheduled ... (omitted for brevity) Arrivals:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693} 16:30 PM LY325 Tel Aviv Ben Gurion International Airport (TLV) El Al Israel Airlines B739 Estimated 21:29 18:30 PM AY877 Helsinki Vantaa Airport (HEL) Finnair E190 Landed 21:21 ... (omitted for brevity) Arrivals:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693} 00:15 AM AF982 Douala International Airport (DLA) Air France 772 Scheduled 23:15 PM AA44 New York John F. Kennedy International Airport (JFK) American Airlines B763 Scheduled ... (omitted for brevity)
使用Python测试:3.4.2 - Scrapy 1.4.0
答案 2 :(得分:0)
我尝试在本地进行克隆并进行更好的调查,但是当它进入离开解析时,我遇到了一些ConnectionRefused错误,所以我不确定我提出的答案会解决它,无论如何:
w
但基本上这些都是你的错误:
在你的parse_schedule和你的parse_departures_page中,你有条件让你产生最后的项目;
您将错误的网址传递给parse_departures_page;
你需要在parse_departures_page上使用dont_filter = True;
您正在尝试保留大量循环以将更多信息解析到同一对象
我建议的更改将跟踪此机场上的所有urls_departures,以便您可以在parse_departures_page上进行迭代,并解决您的问题。
即使这样可以解决您的问题,我也建议您更改数据结构,这样您就可以有多个项目离开,并能够更有效地提取这些信息。