Yield请求调用使用scrapy

时间:2017-04-27 20:46:41

标签: python recursion web-scraping scrapy yield

我试图在一天内使用Python和Scrapy从所有国家的所有机场取消所有出发和到达。

这个着名站点(飞行雷达)使用的JSON数据库需要在出发或到达时逐页查询。在一个机场100。我还根据查询的实际日UTC计算时间戳。

我尝试使用此层次结构创建数据库:

country 1
 - airport 1
    - departures
      - page 1
      - page ...
    - arrivals
      - page 1
      - page ...
- airport 2
    - departures
      - page 1
      - page ...
    - arrivals
      - page 
      - page ...
...

我使用两种方法按页面计算时间戳和网址查询:

def compute_timestamp(self):
    from datetime import datetime, date
    import calendar
    # +/- 24 heures
    d = date(2017, 4, 27)
    timestamp = calendar.timegm(d.timetuple())
    return timestamp

def build_api_call(self,code,page,timestamp):
    return 'https://api.flightradar24.com/common/v1/airport.json?code={code}&plugin\[\]=&plugin-setting\[schedule\]\[mode\]=&plugin-setting\[schedule\]\[timestamp\]={timestamp}&page={page}&limit=100&token='.format(
        code=code, page=page, timestamp=timestamp)

我将结果存储到CountryItem,其中包含许多AirportItem个机场。我的item.py是:

class CountryItem(scrapy.Item):
    name = scrapy.Field()
    link = scrapy.Field()
    num_airports = scrapy.Field()
    airports = scrapy.Field()
    other_url= scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

class AirportItem(scrapy.Item):
    name = scrapy.Field()
    code_little = scrapy.Field()
    code_total = scrapy.Field()
    lat = scrapy.Field()
    lon = scrapy.Field()
    link = scrapy.Field()
    departures = scrapy.Field()
    arrivals = scrapy.Field()

我的主要解析为所有国家/地区构建了一个国家/地区项目(例如,我在此限制为以色列)。接下来,我为每个国家/地区scrapy.Request收取机场费用。

###################################
# MAIN PARSE
####################################
def parse(self, response):
    count_country = 0
    countries = []
    for country in response.xpath('//a[@data-country]'):
        item = CountryItem()
        url =  country.xpath('./@href').extract()
        name = country.xpath('./@title').extract()
        item['link'] = url[0]
        item['name'] = name[0]
        item['airports'] = []
        count_country += 1
        if name[0] == "Israel":
            countries.append(item)
            self.logger.info("Country name : %s with link %s" , item['name'] , item['link'])
            yield scrapy.Request(url[0],meta={'my_country_item':item}, callback=self.parse_airports)

此方法为每个机场提取信息,并且还要求每个机场scrapy.request使用机场网址来刮取出发和到达:

  ###################################
# PARSE EACH AIRPORT
####################################
def parse_airports(self, response):
    item = response.meta['my_country_item']
    item['airports'] = []

    for airport in response.xpath('//a[@data-iata]'):
        url = airport.xpath('./@href').extract()
        iata = airport.xpath('./@data-iata').extract()
        iatabis = airport.xpath('./small/text()').extract()
        name = ''.join(airport.xpath('./text()').extract()).strip()
        lat = airport.xpath("./@data-lat").extract()
        lon = airport.xpath("./@data-lon").extract()
        iAirport = AirportItem()
        iAirport['name'] = self.clean_html(name)
        iAirport['link'] = url[0]
        iAirport['lat'] = lat[0]
        iAirport['lon'] = lon[0]
        iAirport['code_little'] = iata[0]
        iAirport['code_total'] = iatabis[0]

        item['airports'].append(iAirport)

    urls = []
    for airport in item['airports']:
        json_url = self.build_api_call(airport['code_little'], 1, self.compute_timestamp())
        urls.append(json_url)
    if not urls:
        return item

    # start with first url
    next_url = urls.pop()
    return scrapy.Request(next_url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': 0})

使用递归方法parse_schedule我将每个机场添加到国家/地区项目。在这一点上,SO成员已经help me

###################################
# PARSE EACH AIRPORT OF COUNTRY
###################################
def parse_schedule(self, response):
        """we want to loop this continuously to build every departure and arrivals requests"""
        item = response.meta['airport_item']
        i = response.meta['i']
        urls = response.meta['airport_urls']

        urls_departures, urls_arrivals = self.compute_urls_by_page(response, item['airports'][i]['name'], item['airports'][i]['code_little'])

        print("urls_departures = ", len(urls_departures))
        print("urls_arrivals = ", len(urls_arrivals))

        ## YIELD NOT CALLED
        yield scrapy.Request(response.url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': urls_departures, 'i':0 , 'p': 0}, dont_filter=True)

        # now do next schedule items
        if not urls:
            yield item
            return
        url = urls.pop()

        yield scrapy.Request(url, self.parse_schedule, meta={'airport_item': item, 'airport_urls': urls, 'i': i + 1})

self.compute_urls_by_page方法计算正确的网址,以检索一个机场的所有出发和到达。

###################################
# PARSE EACH DEPARTURES / ARRIVALS
###################################
def parse_departures_page(self, response):
    item = response.meta['airport_item']
    p = response.meta['p']
    i = response.meta['i']
    page_urls = response.meta['page_urls']

    print("PAGE URL = ", page_urls)

    if not page_urls:
        yield item
        return
    page_url = page_urls.pop()

    print("GET PAGE FOR  ", item['airports'][i]['name'], ">> ", p)

    jsonload = json.loads(response.body_as_unicode())
    json_expression = jmespath.compile("result.response.airport.pluginData.schedule.departures.data")
    item['airports'][i]['departures'] = json_expression.search(jsonload)

    yield scrapy.Request(page_url, self.parse_departures_page, meta={'airport_item': item, 'page_urls': page_urls, 'i': i, 'p': p + 1})

接下来,通常调用parse_schedule递归方法的self.parse_departure_page中的第一个产量会产生奇怪的结果。 Scrapy打电话给这个方法,但我只收集一个机场的离境页面我不明白为什么...... 我的请求中可能有订单错误或产生源代码,所以也许你可以帮我找出来。

完整的代码在GitHub上https://github.com/IDEES-Rouen/Flight-Scrapping/tree/master/flight/flight_project

您可以使用scrapy cawl airports命令运行它。

更新1:

我尝试使用yield from单独回答问题,但没有成功,因为您可以看到答案底部...所以如果您有想法?

3 个答案:

答案 0 :(得分:8)

是的,我终于在SO上找到了答案here

使用递归import java.io.*; import java.net.*; import java.util.*; class ClientThreads extends Thread { public int once = 0; public String name1 = ""; private String clientName = null; private DataInputStream is = null; private PrintStream os = null; private Socket clientSocket = null; private final ClientThreads[] threads; private int maxClientsCount; public int position = 0; public int filePort = 0; public static List<String> listName = Collections.synchronizedList(new ArrayList<String>()); List<String> l = Collections.synchronizedList(new ArrayList<String>()); public String[] namesList = new String[10]; public ClientThreads(Socket clientSocket, ClientThreads[] threads, String name, String[] namesList, List<String> listName, int filePort) { this.clientSocket = clientSocket; this.threads = threads; maxClientsCount = threads.length; this.name1 = name; this.namesList = namesList; this.filePort = filePort; } @SuppressWarnings("deprecation") public void run() { int maxClientsCount = this.maxClientsCount; ClientThreads[] threads = this.threads; synchronized (listName) { //Iterator i = listName.iterator(); // Must be in synchronized block ListIterator<String> i = listName.listIterator(); } try { if(once==0) { OutputStream os2 = clientSocket.getOutputStream(); OutputStreamWriter osw = new OutputStreamWriter(os2); BufferedWriter bw = new BufferedWriter(osw); once=3; } once=3; System.out.println("This line is executed before inputstream and output stream are created"); is = new DataInputStream(clientSocket.getInputStream()); os = new PrintStream(clientSocket.getOutputStream()); String name; while (true) { os.println("What is your name?"); name = is.readLine().trim(); break; } if(listName != null) { synchronized(listName) { if(!listName.contains(name)) { listName.add(name); os.println("The name " + name + " has been added to the list. " + "The list now contains " + listName); } } } synchronized (this) { for (int i = 0; i < maxClientsCount; i++) { if (threads[i] != null && threads[i] == this) { clientName = "@" + name; break; } } for (int i = 0; i < maxClientsCount; i++) { if (threads[i] != null) { threads[i].os.println(name + " has connected."); os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n"); } } } while (true) { synchronized(listName) { if(!listName.contains(name)) { listName.add(name); os.println("The name " + name + " has been added to the list. " + "The list now contains " + listName); } } String line = is.readLine(); if (line.startsWith("/quit")) { break; } else { //os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n"); } if(line.equals("x") || line.equals("X")) { os.println("x"); System.exit(0); } if(line.equals("m") || line.equals("M")) { os.println("Enter your message: "); } if(line.equals("f") || line.equals("F")) { os.println("Who owns the file?"); boolean keep = true; while (keep == true) { String fileOwner = is.readLine(); if(fileOwner !=null && !fileOwner.isEmpty()) { os.println(fileOwner); System.out.println(fileOwner); System.out.println(); //namesList[position] = fileOwner; // os.println("The client connected name: " + fileOwner + // " has been added to position " + position + // " in the namesList array " + namesList[position]); //position++; keep = false; os.println("Which file do you want?"); boolean keep2 = true; while(keep2==true) { String filename = is.readLine(); if(filename !=null && !filename.isEmpty()) { // os.println(filename); keep2 = false; os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n"); } } } } } if (line.startsWith("@")) { String[] words = line.split("\\s", 2); if (words.length > 1 && words[1] != null) { words[1] = words[1].trim(); if (!words[1].isEmpty()) { synchronized (this) { for (int i = 0; i < maxClientsCount; i++) { if (threads[i] != null && threads[i] != this && threads[i].clientName != null && threads[i].clientName.equals(words[0])) { threads[i].os.println(name + ": " + words[1]); this.os.println(name + ": " + words[1]); this.os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n"); break; } } } } } } else { synchronized (this) { for (int i = 0; i < maxClientsCount; i++) { if (!line.equals("x") && !line.equals("X") && !line.equals("f") && !line.equals("F") && !line.equals("m") && !line.equals("M") && threads[i] != null && threads[i].clientName != null && !threads[i].clientName.equals("m") && !threads[i].clientName.equals("M")) { threads[i].os.println(name + ": " + line + "\n"); this.os.println("Enter an option: 'm' = message, 'f' = file request, 'x' = exit\n"); } } } } } synchronized (this) { for (int i = 0; i < maxClientsCount; i++) { if (threads[i] != null && threads[i] != this && threads[i].clientName != null) { threads[i].os.println(name + "has disconnected."); } } } synchronized (this) { for (int i = 0; i < maxClientsCount; i++) { if (threads[i] == this) { threads[i] = null; } } } is.close(); os.close(); clientSocket.close(); } catch (IOException e) { } } } 时,您需要使用yield。这里简化了一个例子:

yield from

更新,请勿使用真实计划:

我尝试重现相同的airport_list = ["airport1", "airport2", "airport3", "airport4"] def parse_page_departure(airport, next_url, page_urls): print(airport, " / ", next_url) if not page_urls: return next_url = page_urls.pop() yield from parse_page_departure(airport, next_url, page_urls) ################################### # PARSE EACH AIRPORT OF COUNTRY ################################### def parse_schedule(next_airport, airport_list): ## GET EACH DEPARTURE PAGE departures_list = ["p1", "p2", "p3", "p4"] next_departure_url = departures_list.pop() yield parse_page_departure(next_airport,next_departure_url, departures_list) if not airport_list: print("no new airport") return next_airport_url = airport_list.pop() yield from parse_schedule(next_airport_url, airport_list) next_airport_url = airport_list.pop() result = parse_schedule(next_airport_url, airport_list) for i in result: print(i) for d in i: print(d) 模式with the real program here,但我在yield from上使用它时出错了,不明白为什么......

这里是python回溯:

scrapy.Request

答案 1 :(得分:4)

  

评论:...不完全清楚......你打电话给AirportData(回复,1)......这里也有点错误:self.pprint(schedule)

我使用class AirportData来实施(限制为2页和2个航班) 更新了我的代码,删除了class AirportData 并添加了class Page 现在应该满足所有依赖项。

不是错误,self.pprint(...是用于漂亮打印对象的class AirportsSpider Method,就像结尾处显示的输出一样。我已经增强了class Schedule以显示基本用法。

  

评论:您的答案中什么是AirportData?

编辑class AirportData已删除 如# ENDPOINT所述,Page objectpage.arrivals已分割page.departures个广告数据。 (限2页和2个航班)

Page = [Flight 1, Flight 1, ... Flight n] 
schedule.airport['arrivals'] == [Page 1, Page 2, ..., Page n]
schedule.airport['departures'] == [Page 1, Page 2, ..., Page n]
  

评论:...我们有多个页面,其中包含多个离场/到达。

是的,在第一次回答的时候,我没有任何api json的回应 现在我收到api json的回复,但没有反映给定的timestamp,从current date返回。 api params看起来不常见,有链接到描述吗?

尽管如此,请考虑这种简化方法:

#页面对象持有一页到达/离开航班数据

class Page(object):
    def __init__(self, title, schedule):
        # schedule includes ['arrivals'] or ['departures]
        self.current = schedule['page']['current']
        self.total = schedule['page']['total']

        self.header = '{}:page:{} item:{}'.format(title, schedule['page'], schedule['item'])
        self.flight = []
        for data in schedule['data']:
            self.flight.append(data['flight'])

    def __iter__(self):
        yield from self.flight

#计划对象持有一个机场所有页面

class Schedule(object):
    def __init__(self):
        self.country = None
        self.airport = None

    def __str__(self):
        arrivals = self.airport['arrivals'][0]
        departures = self.airport['departures'][0]
        return '{}\n\t{}\n\t\t{}\n\t\t\t{}\n\t\t{}\n\t\t\t{}'. \
            format(self.country['name'],
                   self.airport['name'],
                   arrivals.header,
                   arrivals.flight[0]['airline']['name'],
                   departures.header,
                   departures.flight[0]['airline']['name'], )

#PARSE每个国家的机场

def parse_schedule(self, response):
    meta = response.meta

    if 'airport' in meta:
        # First call from parse_airports
        schedule = Schedule()
        schedule.country = response.meta['country']
        schedule.airport = response.meta['airport']
    else:
        schedule = response.meta['schedule']

    data = json.loads(response.body_as_unicode())
    airport = data['result']['response']['airport']

    schedule.airport['arrivals'].append(Page('Arrivals', airport['pluginData']['schedule']['arrivals']))
    schedule.airport['departures'].append(Page('Departures', airport['pluginData']['schedule']['departures']))

    page = schedule.airport['departures'][-1]
    if page.current < page.total:
        json_url = self.build_api_call(schedule.airport['code_little'], page.current + 1, self.compute_timestamp())
        yield scrapy.Request(json_url, meta={'schedule': schedule}, callback=self.parse_schedule)
    else:
        # ENDPOINT Schedule object holding one Airport.
        # schedule.airport['arrivals'] and schedule.airport['departures'] ==
        #   List of Page with List of Flight Data
        print(schedule)

#PARSE EACH AIRPORT

def parse_airports(self, response):
    country = response.meta['country']

    for airport in response.xpath('//a[@data-iata]'):
        name = ''.join(airport.xpath('./text()').extract()[0]).strip()

        if 'Charles' in name:
            meta = response.meta
            meta['airport'] = AirportItem()
            meta['airport']['name'] = name
            meta['airport']['link'] = airport.xpath('./@href').extract()[0]
            meta['airport']['lat'] = airport.xpath("./@data-lat").extract()[0]
            meta['airport']['lon'] = airport.xpath("./@data-lon").extract()[0]
            meta['airport']['code_little'] = airport.xpath('./@data-iata').extract()[0]
            meta['airport']['code_total'] = airport.xpath('./small/text()').extract()[0]

            json_url = self.build_api_call(meta['airport']['code_little'], 1, self.compute_timestamp())
            yield scrapy.Request(json_url, meta=meta, callback=self.parse_schedule)

#MAIN PARSE

  

注意response.xpath('//a[@data-country]')返回所有国家/地区两次

def parse(self, response):
    for a_country in response.xpath('//a[@data-country]'):
            name = a_country.xpath('./@title').extract()[0]
            if name == "France":
                country = CountryItem()
                country['name'] = name
                country['link'] = a_country.xpath('./@href').extract()[0]

                yield scrapy.Request(country['link'],
                                     meta={'country': country},
                                     callback=self.parse_airports)
  

Qutput :缩短为 2 页面, 2 每页飞行次数

France
    Paris Charles de Gaulle Airport
        Departures:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696}
            21:30 PM    AF1558  Newcastle Airport (NCL) Air France ARJ  Estimated dep 21:30
            21:30 PM    VY8833  Seville San Pablo Airport (SVQ) Vueling 320 Estimated dep 21:30
            ... (omitted for brevity)
        Departures:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 696}
            07:30 AM    AF1680  London Heathrow Airport (LHR)   Air France 789  Scheduled
            07:30 AM    SN3628  Brussels Airport (BRU)  Brussels Airlines 733   Scheduled
            ... (omitted for brevity)
        Arrivals:(page=(1, 1, 7)) 2017-07-02 21:28:00 page:{'current': 1, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693}
            16:30 PM    LY325   Tel Aviv Ben Gurion International Airport (TLV) El Al Israel Airlines B739  Estimated 21:29
            18:30 PM    AY877   Helsinki Vantaa Airport (HEL)   Finnair E190    Landed 21:21
            ... (omitted for brevity)
        Arrivals:(page=(2, 2, 7)) 2017-07-02 21:28:00 page:{'current': 2, 'total': 7} item:{'current': 100, 'limit': 100, 'total': 693}
            00:15 AM    AF982   Douala International Airport (DLA)  Air France 772  Scheduled
            23:15 PM    AA44    New York John F. Kennedy International Airport (JFK)    American Airlines B763  Scheduled
            ... (omitted for brevity)

使用Python测试:3.4.2 - Scrapy 1.4.0

答案 2 :(得分:0)

我尝试在本地进行克隆并进行更好的调查,但是当它进入离开解析时,我遇到了一些ConnectionRefused错误,所以我不确定我提出的答案会解决它,无论如何:

w

但基本上这些都是你的错误:

  1. 在你的parse_schedule和你的parse_departures_page中,你有条件让你产生最后的项目;

  2. 您将错误的网址传递给parse_departures_page;

  3. 你需要在parse_departures_page上使用dont_filter = True;

  4. 您正在尝试保留大量循环以将更多信息解析到同一对象

  5. 我建议的更改将跟踪此机场上的所有urls_departures,以便您可以在parse_departures_page上进行迭代,并解决您的问题。

    即使这样可以解决您的问题,我也建议您更改数据结构,这样您就可以有多个项目离开,并能够更有效地提取这些信息。