Question

我尝试通过以下网址获取城市的纬度和经度坐标：https://www.latlong.net/。我的代码是：

# -*- coding: utf-8 -*-
import re
import json

import scrapy

class geo_spider(scrapy.Spider):
    name = "geo"
    allowed_domains = ["www.latlong.net"]
    start_urls = ['https://www.latlong.net/']

    custom_settings = {
        'COOKIES_ENABLED': True,
        'DOWNLOAD_DELAY' : 1,
    }

    LAT_LONG_REGEX = 'sm\((?P<lat>.+),(?P<long>.+),'

    def start_requests(self):
        FILE_PATH = 'C:/Users/coppe/tutorial/cities.json'
        with open(FILE_PATH) as json_file:
            cities_data = json.load(json_file)
        for d in cities_data:
            yield scrapy.Request(
                url='https://www.latlong.net/',
                callback=self.gen_csrftoken,
                meta={'city': d['city']},
                dont_filter=True, 
            )

        def gen_csrftoken(self, response):
            city = response.meta['city']
            yield scrapy.FormRequest.from_response(
                response,
                formid='frmPlace',
                formdata={'place': city},
                callback=self.get_geo,
                meta={'city': city}
            )

        def get_geo(self, response):
            lat_long_search = re.search(self.LAT_LONG_REGEX, response.body.decode('utf-8'))
            if lat_long_search:
                yield {
                    'coord': (lat_long_search.group('lat'), lat_long_search.group('long')),
                    'city': response.meta['city']
                }
                else:
                    from scrapy.shell import inspect_response
                    inspect_response(response, self)

对于JSON文件中包含的589个城市，我应该得到类似（50,5）的坐标。一切正常，除了每个城市我得到（0,0）。我认为这是javascript的问题，但不是。确实，当我将JSON文件减少到例如6个城市时，我得到了每个城市的正确坐标。我尝试将DOWNLOAD_DELAY设置为不同的值（1,2和3），但仍然无法正常工作。我的JSON文件太重了吗？有人对此有线索吗？

Answer 1

该网站似乎正在使用API，例如Google Maps geocoding API，该文档记录在 https://developers.google.com/maps/documentation/geocoding/intro
该文档（不是在说一次执行多个请求，不是在使用实际的API吗？）表示，该API链接的最大大小为8192个字符，包括链接本身以及您要查找的所有位置。
因此，是的，除了可能受到速率限制之外，您的城市名称中必须包含最大字符数！

地理编码API请求采用以下形式： https://maps.googleapis.com/maps/api/geocode/outputFormat?parameters ...
注意：必须对URL进行正确编码以使其有效，并且所有Web服务的URL限制为8192个字符。构造URL时请注意此限制。请注意，不同的浏览器，代理和服务器也可能具有不同的URL字符限制。

Scrapy-要求太多？

1 个答案: