到目前为止,这是我的代码,应该抓取链接,餐厅名称及其电子邮件地址。一切正常,直到我添加了电子邮件,即使它返回了电子邮件地址
import scrapy
from scrapy import Request
class RestaurantSpider(scrapy.Spider):
name = 'restaurant'
start_urls = [
'https://www.tripadvisor.com.my/Restaurants-g298570-Kuala_Lumpur_Wilayah_Persekutuan.html#EATERY_OVERVIEW_BOX']
def parse
是我从主页收集所有列表的地方,然后遍历每个页面访问每个餐厅页面
def parse(self, response):
listings = response.xpath(
'//div[@class="restaurants-list-ListCell__cellContainer--2mpJS"]')
for listing in listings:
link = listing.xpath(
'.//a[@class="restaurants-list-ListCell__restaurantName--2aSdo"]/@href').extract_first()
text = listing.xpath(
'.//a[@class="restaurants-list-ListCell__restaurantName--2aSdo"]/text()').extract_first()
yield scrapy.Request(url=response.urljoin(link),
callback=self.parse_listing,
meta={
'Link': link,
'Text': text
}
)
next_urls = response.xpath(
'//*[@class="nav next rndBtn ui_button primary taLnk"]/@href').extract()
for next_url in next_urls:
yield scrapy.Request(response.urljoin(next_url), callback=self.parse)
def parse_listing
是我访问特定餐厅的电子邮件,然后生成所需数据的地方,以后将其存储到.csv文件
def parse_listing(self, response):
link = response.meta['link']
text = response.meta['text']
email = response.xpath(
'//a[contains(@href, "mailto")]/@href').extract_first()
yield {
'Link': link,
'Text': text,
'Email': email
}
答案 0 :(得分:0)
用'href'代替'link'
不能重现您的代码,但似乎不是链接属性。...因此捕获“ href”
<a href="/Restaurant_Review-g298570-d15211507-Reviews-Vintage_1988_Cafe-Kuala_Lumpur_Wilayah_Persekutuan.html" class="restaurants-list-ListCell__restaurantName--2aSdo" target="_blank">Vintage 1988 Cafe</a>
link = response.meta['href']
答案 1 :(得分:0)
您在meta={'Link':link,'Text':text}
方法中定义了parse()
,但是您在link
方法中调用了错误的键parse_listing()
以获取引起错误的值。您的xpath容易出错。
尝试使用以下方法使其正常工作:
class RestaurantSpider(scrapy.Spider):
name = 'restaurant'
start_urls = [
'https://www.tripadvisor.com.my/Restaurants-g298570-Kuala_Lumpur_Wilayah_Persekutuan.html#EATERY_OVERVIEW_BOX'
]
def parse(self, response):
for listing in response.xpath('//div[contains(@class,"__cellContainer--")]'):
link = listing.xpath('.//a[contains(@class,"__restaurantName--")]/@href').get()
text = listing.xpath('.//a[contains(@class,"__restaurantName--")]/text()').get()
complete_url = response.urljoin(link)
yield scrapy.Request(
url=complete_url,
callback=self.parse_listing,
meta={'link': complete_url,'text': text}
)
next_url = response.xpath('//*[contains(@class,"pagination")]/*[contains(@class,"next")]/@href').get()
if next_url:
yield scrapy.Request(response.urljoin(next_url), callback=self.parse)
def parse_listing(self, response):
link = response.meta['link']
text = response.meta['text']
email = response.xpath('//a[contains(@href, "mailto:")]/@href').get()
yield {'Link': link,'Text': text,'Email': email}