Question

作为scrapy的新手，我无法弄清楚为什么这个蜘蛛不会抓取网站上的数据来抓取。我已经在stackoverflow中搜索了可能的答案，但我发现它没有得到充分的解决。我正试图从网站上搜集一个小镇 - 餐馆列表。我没有详细了解网站的安全功能。这是与XPath选择元素相关的问题吗？蜘蛛运行良好，除了它没有刮掉任何东西。你能否说明为什么它没有刮，以及如何解决问题。蜘蛛有以下代码：

try:
    from scrapy.spiders import Spider
    from urllib.parse import urljoin
    from scrapy.selector import Selector
    from scrapy.http import Request

except ImportError:
    print ("\nERROR IMPORTING THE NESSASARY LIBRARIES\n")

#scrapy.optional_features.remove('boto')


class YelpSpider(Spider):
    name = 'yelp_spider'
    allowed_domains=["yelp.com"]
    headers=['venuename','services','address','phone','location']

    def __init__(self):
        self.start_urls = ['https://www.yelp.com/springfield-il-us']

    def start_requests(self):
        requests = []
        for item in self.start_urls:
            requests.append(Request(url=item, headers={'Referer':'http://www.google.com/'}))
            return requests

    def parse(self, response):
        requests=[] 
        sel=Selector(response)
        restaurants=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[3]/div[1]/div[1]/h1')
        items=[]
        for restaurant in restaurants:
            item=YelpRestaurantItem()
            item['venuename']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[3]/div[1]/div[1]/h1')
            item['services']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[3]/div[1]/div[2]/div[2]/span[2]/a[1]')
            item['address']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[1]/div/strong/address')
            item['phone']=sel.xpath('//*[@id="wrap"]/div[4]/div/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[3]/span[3]')
            item['location']=sel.xpath('//*[@id="dropperText_Mast"]')
            item['url']=response.url
            items.append(item)
            yield item

我的items.py包含以下代码：

import scrapy

class YelpRestaurantItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url=scrapy.Field()
    venuename = scrapy.Field()
    services = scrapy.Field()
    address = scrapy.Field()
    phone = scrapy.Field()
    location=scrapy.Field()

Answer 1

你的进口在这里做得不好，但这可能是我身边的配置问题。我认为下面的刮刀可以满足您的需求：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelp_spider'
    allowed_domains=["yelp.com"]
    headers=['venuename','services','address','phone','location']

    def __init__(self):
        self.start_urls = ['https://www.yelp.com/search?find_desc=&find_loc=Springfield%2C+IL&ns=1']

    def start_requests(self):
        requests = []
        for item in self.start_urls:
            requests.append(scrapy.Request(url=item, headers={'Referer':'http://www.google.com/'}))
            return requests

    def parse(self, response):
        for restaurant in response.xpath('//div[@class="biz-listing-large"]'):
            item={}
            item['venuename']=restaurant.xpath('.//h3[@class="search-result-title"]/span/a/span/text()').extract_first()
            item['services']=u",".join(line.strip() for line in restaurant.xpath('.//span[@class="category-str-list"]/a/text()').extract())
            item['address']=restaurant.xpath('.//address/text()').extract_first()
            item['phone']=restaurant.xpath('.//span[@class="biz-phone"]/text()').extract_first()
            item['location']=response.xpath('.//input[@id="dropperText_Mast"]/@value').extract_first()
            item['url']=response.url
            yield item

一些解释：

我已经更改了启动网址。这个网址实际上提供了所有餐厅的概述，而另一个没有（或者至少从我的位置查看时）。

我删除了管道，因为它没有在我的系统中定义，我也无法使用代码中不存在的管道进行尝试。

解析函数是我真正改变的函数。你定义的xpath不是很清楚。现在，代码遍布每个列出的餐厅。

response.xpath('//div[@class="biz-listing-large"]')

此代码捕获所有餐馆数据。我已经在for循环中使用了这个，所以我们可以为每个餐厅执行操作。此数据在变量restaurant中可用。

因此，如果我想从餐馆提取数据，我会使用这个变量。另外，我们需要使用.启动xpath，因为脚本将从网页的开头开始（这与使用响应相同）。

为了理解我的答案中的xpath，我可以向你解释这个问题，但是有很多文档可供使用，他们可能比我更好地解释这个。

Some documentation

And some more

请注意，我已将item的大部分值用于餐厅。来自location和url的值不是真正的餐馆数据，而是位于网页的其他位置。这就是为什么这些值使用response代替restaurant。

scrapy / Python抓取但不抓取数据

1 个答案: