我正在使用此网站获取不同城市的经纬度:https://www.latlong.net/。
这是我的代码:
import scrapy
import json
with open('C:/Users/coppe/tutorial/cities.json') as json_file:
cities = json.load(json_file)
class communes_spider(scrapy.Spider):
name = "geo"
start_urls = ['https://www.latlong.net/']
def parse(self, response):
for city in cities:
yield scrapy.FormRequest.from_response(response, formid='place', formdata={'place': city['city']}, callback=self.get_geo)
def get_geo(self, response):
yield {'coord': response.css('input::text').get()}
代码运行得很好,但是我得到的输出不正确。默认输出值为(0,0),并且在表单后应为(50.643909,5.571560)。但是,搜寻器仍会收集(0,0)作为答案。我猜问题出在网站上,但我无法确定。
JSON示例:
[{"city": "Anvers, BE"},
{"city": "Gand, BE"},
{"city": "Charleroi, BE"},
{"city": "Li\u00e8ge, BE"},
{"city": "Ville de Bruxelles, BE"},
{"city": "Schaerbeek, BE"},
{"city": "Anderlecht, BE"},
{"city": "Bruges, BE"},
{"city": "Namur, BE"},
{"city": "Louvain, BE"},
{"city": "Molenbeek-Saint-Jean, BE"}]
答案 0 :(得分:1)
您可以尝试下面的代码,它在我这边工作:
# -*- coding: utf-8 -*-
import re
import json
import scrapy
class communes_spider(scrapy.Spider):
name = "geo"
allowed_domains = ["www.latlong.net"]
start_urls = ['https://www.latlong.net/']
custom_settings = {
'COOKIES_ENABLED': True,
}
# This regex is not perfect and can be improved
LAT_LONG_REGEX = 'sm\((?P<lat>-?\d+\.?\d+),(?P<long>-?\d+\.?\d+)'
def start_requests(self):
FILE_PATH = 'C:/Users/coppe/tutorial/cities.json'
with open(FILE_PATH) as json_file:
cities_data = json.load(json_file)
for d in cities_data:
yield scrapy.Request(
url='https://www.latlong.net/',
callback=self.gen_csrftoken,
meta={'city': d['city']},
dont_filter=True, # Allow to request multiple time the same URL
)
def gen_csrftoken(self, response):
city = response.meta['city']
yield scrapy.FormRequest.from_response(
response,
formid='frmPlace',
formdata={'place': city},
callback=self.get_geo,
meta={'city': city}
)
def get_geo(self, response):
lat_long_search = re.search(self.LAT_LONG_REGEX, response.body.decode('utf-8'))
if lat_long_search:
yield {
'coord': (lat_long_search.group('lat'), lat_long_search.group('long')),
'city': response.meta['city']
}
else:
# Something is wrong, you can investigate with `inspect_response`
from scrapy.shell import inspect_response
inspect_response(response, self)
找到(0,0)的原因是,经/纬度坐标是通过javascript显示的(它们是从模板内部的后端填充的)。 没有Splash,Scrapy无法执行javascript。
因此,基本上,我们正在做的事情是使用Regex解析JS脚本,以找到经度/纬度值。
(如果您认为此答案有帮助,请不要忘记标记为已接受)