我正在尝试学习Scrapy,我正在yelp网站上学习 这个LINK 但是当scrapy运行时,它会反复刮擦相同的手机,地址,而不是刮擦不同的部分。我使用的选择器是属于页面的每个餐馆的特定类的所有“li”标签,每个li标签包含我使用适当的选择器的每个餐馆信息但是scrapy给我结果重复形式仅2或3个餐馆。出于某种原因,Scrapy一次又一次地使用相同的部件,一旦它们在for循环中完成就应该跳过它们。 以下是代码
try:
import scrapy
from urlparse import urljoin
except ImportError:
print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"
#scrapy.optional_features.remove('boto')
url = raw_input('ENTER THE SITE URL : ')
class YelpSpider(scrapy.Spider):
name = 'yelp spider'
start_urls = [url]
def parse(self, response):
SET_SELECTOR = '.regular-search-result'
#Going over each li tags containg each resturant belonging to this class
for yelp in response.css(SET_SELECTOR):
#getting a slector to get a link to scrape website info from another page
selector = '.indexed-biz-name a ::attr(href)'
#getting the complete url joining the extracted part
momo = urljoin(response.url, yelp.css(selector).extract_first())
#All the selectors
name = '.indexed-biz-name a span ::text'
services = '.category-str-list a ::text'
address1 = '.neighborhood-str-list ::text'
address2 = 'address ::text'
phone = '.biz-phone ::text'
# extracting them and adding them in a dict
try:
add1 = response.css(address1).extract_first().replace('\n','').replace('\n','')
add2 = response.css(address2).extract_first().replace('\n','').replace('\n','')
ADDRESS = add1 + ' ' + add2
pookiebanana = {
"PHONE": response.css(phone).extract_first().replace('\n','').replace('\t',''),
"NAME": response.css(name).extract_first().replace('\n','').replace('\t',''),
"SERVICES": response.css(services).extract_first().replace('\n','').replace('\t',''),
"ADDRESS": ADDRESS,
}
except:
pass
#Opening another page passing the old dict
Post = scrapy.Request(momo, callback=self.parse_yelp, meta={'item': pookiebanana})
#yielding the dict with the website scraped
yield Post
#Clicking the next button and recursively calling the same function with the same link
NEXT_PAGE_SELECTOR = '.u-decoration-none.next.pagination-links_anchor ::attr(href)'
next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
if next_page:
yield scrapy.Request(
response.urljoin(next_page),
callback=self.parse
)
def parse_yelp(self, response):
#Website selector opening a new page from the link we extracted
WEBSITE_SELECTOR = '.biz-website.js-add-url-tagging a ::text'
item = response.meta['item']
#inside the try block extracting the website info and returning the modified dict
try:
item['WEBSITE'] = ' '.join(response.css(WEBSITE_SELECTOR).extract_first().split(' '))
except:
pass
return item
我在代码中广泛评论了我做了什么。我做错了什么?
答案 0 :(得分:2)
我无法测试,但在for yelp
循环内你应该使用yelp.css()
但是你使用response.css()