尝试抓取TripAdvisor的电子邮件地址时出现KeyError:“链接”

时间:2019-10-17 14:52:11

标签: python web-scraping

到目前为止,这是我的代码,应该抓取链接,餐厅名称及其电子邮件地址。一切正常,直到我添加了电子邮件,即使它返回了电子邮件地址

import scrapy
from scrapy import Request


class RestaurantSpider(scrapy.Spider):
    name = 'restaurant'
    start_urls = [
        'https://www.tripadvisor.com.my/Restaurants-g298570-Kuala_Lumpur_Wilayah_Persekutuan.html#EATERY_OVERVIEW_BOX']

def parse是我从主页收集所有列表的地方,然后遍历每个页面访问每个餐厅页面

    def parse(self, response):
        listings = response.xpath(
            '//div[@class="restaurants-list-ListCell__cellContainer--2mpJS"]')

        for listing in listings:
            link = listing.xpath(
                './/a[@class="restaurants-list-ListCell__restaurantName--2aSdo"]/@href').extract_first()
            text = listing.xpath(
                './/a[@class="restaurants-list-ListCell__restaurantName--2aSdo"]/text()').extract_first()
            yield scrapy.Request(url=response.urljoin(link),
                                 callback=self.parse_listing,
                                 meta={
                                     'Link': link,
                                     'Text': text
            }
            )

        next_urls = response.xpath(
            '//*[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract()
        for next_url in next_urls:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

def parse_listing是我访问特定餐厅的电子邮件,然后生成所需数据的地方,以后将其存储到.csv文件

    def parse_listing(self, response):
        link = response.meta['link']
        text = response.meta['text']

        email = response.xpath(
            '//a[contains(@href, "mailto")]/@href').extract_first()

        yield {
            'Link': link,
            'Text': text,
            'Email': email
        }

2 个答案:

答案 0 :(得分:0)

用'href'代替'link'

不能重现您的代码,但似乎不是链接属性。...因此捕获“ href”

<a href="/Restaurant_Review-g298570-d15211507-Reviews-Vintage_1988_Cafe-Kuala_Lumpur_Wilayah_Persekutuan.html" class="restaurants-list-ListCell__restaurantName--2aSdo" target="_blank">Vintage 1988 Cafe</a>


link = response.meta['href']

答案 1 :(得分:0)

您在meta={'Link':link,'Text':text}方法中定义了parse(),但是您在link方法中调用了错误的键parse_listing()以获取引起错误的值。您的xpath容易出错。

尝试使用以下方法使其正常工作:

class RestaurantSpider(scrapy.Spider):
    name = 'restaurant'

    start_urls = [
        'https://www.tripadvisor.com.my/Restaurants-g298570-Kuala_Lumpur_Wilayah_Persekutuan.html#EATERY_OVERVIEW_BOX'
    ]

    def parse(self, response):
        for listing in response.xpath('//div[contains(@class,"__cellContainer--")]'):
            link = listing.xpath('.//a[contains(@class,"__restaurantName--")]/@href').get()
            text = listing.xpath('.//a[contains(@class,"__restaurantName--")]/text()').get()
            complete_url = response.urljoin(link)
            yield scrapy.Request(
                url=complete_url,
                callback=self.parse_listing,
                meta={'link': complete_url,'text': text}
            )

        next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
        if next_url:
            yield scrapy.Request(response.urljoin(next_url), callback=self.parse)

    def parse_listing(self, response):
        link = response.meta['link']
        text = response.meta['text']
        email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
        yield {'Link': link,'Text': text,'Email': email}