2019-03-17 17:21:06 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://www.google.com/www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html> (referer: http://www.google.com/search?q=Rheinhessen+Germany+coordinates+longitude+latitude+distancesto)
2019-03-17 17:21:06 [scrapy.core.scraper] DEBUG: Scraped from <404 http://www.google.com/www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html>
因此,与其遵循“ www.distancesto.com/coordinates/de/jugenheim-in-rheinhessen-latitude-longitude/history/401814.html”,它之前添加了“ http://www.google.com/”,并且显然以断点形式返回链接。这超出了我,我不明白为什么。响应没有,我什至试图返回22个字符(不希望的preifx长度)后返回,它擦除了真实链接的一部分。
class Googlelocs(Spider):
name = 'googlelocs'
start_urls = []
for i in appellation_list:
baseurl = i.replace(',', '').replace(' ', '+')
cleaned_href = f'http://www.google.com/search?q={baseurl}+coordinates+longitude+latitude+distancesto'
start_urls.append(cleaned_href)
def parse(self, response):
cleaned_href = response.xpath('//*[@id="ires"]/ol/div[1]/h3/a').get().split('https://')[1].split('&')[0]
yield response.follow(cleaned_href, self.parse_distancesto)
def parse_distancesto(self, response):
items = GooglelocItem()
items['appellation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[2]/p/strong)').get()
items['latitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[1]/td)').get()
items['longitude'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[2]/td)').get()
items['elevation'] = response.xpath('string(/html/body/div[3]/div/div[2]/div[3]/div[3]/table/tbody/tr[10]/td)').get()
yield items
这是蜘蛛。
答案 0 :(得分:0)
我找到了答案。
href = response.xpath('// * [@ id =“ ires”] / ol / div [1] / h3 / a / @ href')。get()
这是从Google获取href的正确路径。我也必须接受被Google掩盖的链接,而无需尝试对其进行修改以使其能够跟随它。