Python Scrapy只是一遍又一遍地刮掉相同的元素

时间:2017-03-29 11:14:19

标签: python html css web-scraping scrapy

我正在尝试学习Scrapy,我正在yelp网站上学习 这个LINK 但是当scrapy运行时,它会反复刮擦相同的手机,地址,而不是刮擦不同的部分。我使用的选择器是属于页面的每个餐馆的特定类的所有“li”标签,每个li标签包含我使用适当的选择器的每个餐馆信息但是scrapy给我结果重复形式仅2或3个餐馆。出于某种原因,Scrapy一次又一次地使用相同的部件,一旦它们在for循环中完成就应该跳过它们。 以下是代码

    try:
    import scrapy
    from urlparse import urljoin
except ImportError:
    print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"

#scrapy.optional_features.remove('boto')

url = raw_input('ENTER THE SITE URL : ')

class YelpSpider(scrapy.Spider):
    name = 'yelp spider'
    start_urls = [url]

    def parse(self, response):
        SET_SELECTOR = '.regular-search-result'

        #Going over each li tags containg each resturant belonging to this class

        for yelp in response.css(SET_SELECTOR):

            #getting a slector to get a link to scrape website info from another page
            selector = '.indexed-biz-name a ::attr(href)'

            #getting the complete url joining the extracted part
            momo = urljoin(response.url, yelp.css(selector).extract_first())

            #All the selectors
            name = '.indexed-biz-name a span ::text'
            services = '.category-str-list a ::text'
            address1 = '.neighborhood-str-list ::text'
            address2 = 'address ::text'
            phone = '.biz-phone ::text'

           # extracting them and adding them in a dict 
            try:
                add1 = response.css(address1).extract_first().replace('\n','').replace('\n','')
                add2 = response.css(address2).extract_first().replace('\n','').replace('\n','')
                ADDRESS = add1 + ' ' + add2

                pookiebanana = {

                    "PHONE": response.css(phone).extract_first().replace('\n','').replace('\t',''),
                    "NAME": response.css(name).extract_first().replace('\n','').replace('\t',''),
                    "SERVICES": response.css(services).extract_first().replace('\n','').replace('\t',''),
                    "ADDRESS": ADDRESS,
                }
            except:
                pass

            #Opening another page passing the old dict
            Post = scrapy.Request(momo, callback=self.parse_yelp, meta={'item': pookiebanana})

            #yielding the dict with the website scraped
            yield Post

        #Clicking the next button and recursively calling the same function with the same link
        NEXT_PAGE_SELECTOR = '.u-decoration-none.next.pagination-links_anchor  ::attr(href)'
        next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
        if next_page:
            yield scrapy.Request(
                response.urljoin(next_page),
                callback=self.parse
            )

    def parse_yelp(self, response):
        #Website selector opening a new page from the link we extracted
        WEBSITE_SELECTOR = '.biz-website.js-add-url-tagging a ::text'

        item = response.meta['item']

        #inside the try block extracting the website info and returning the modified dict
        try:
            item['WEBSITE'] = ' '.join(response.css(WEBSITE_SELECTOR).extract_first().split(' '))
        except:
            pass
        return item

我在代码中广泛评论了我做了什么。我做错了什么?

这里是输出csv截图,显示重复 PICTURE

HERE是scrapy scraping输出,因为你可以看到它一遍又一遍地刮擦同样的东西PIC 发生了什么,我做错了什么?

1 个答案:

答案 0 :(得分:2)

我无法测试,但在for yelp循环内你应该使用yelp.css()但是你使用response.css()