使用Selenium with Scrapy会禁用管道功能吗?我该如何重新启用它?

时间:2015-05-22 16:33:41

标签: mongodb selenium twitter scrapy

我目前正在使用Scrapy编写一个Twitter Scraper来编写和处理数据,而Selenium作为一个自动化工具,因为Twitter本身就是一个交互式页面,因此我可以“向下滚动”推文并在一个中获取更多数据扫。

使用我设置的MongoDB管道,它理论上应该将处理后的数据发送到预设的数据库,但由于某种原因没有发送管道,因为我看不到它的调试日志正在运行。

蜘蛛码:

class TwitterScraper(Spider):
    query = "nike"
    #Using BaseSpider to define rules
    ##name of spider for "scrapy crawl ____"
    name = "twitter"
    ##allowed_domains contains base-URLs for spider to crawl
    allowed_domains = ["twitter.com"]
    ##start_urls define list of urls for spider to start crawling from.
    start_urls = ["https://twitter.com/search?q="+query+"&src=typd&vertical=default"]
    #Using PhantomJS in constructer
    def parse(self, response):
        #Init PhantomJS
        self.driver = webdriver.PhantomJS()
        #Set Phantom Window Size
        self.driver.set_window_size(1120, 550)
        #response url defines the url with data to be parsed
        self.driver.get(response.url)
        #sleeping to fully load Twitter page
        time.sleep(1)
        #counter scrolls down for amount of pages
        count = 0
        scroll_down = 2
        while count < scroll_down:
            #javascript for selenium webdriver scrolls down
            self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            #allows page to fully load contents
            time.sleep(2)
            count += 1
            print("aggregating tweets! step: " + str(count) + " of " + str(scroll_down))
        #init html xpath selection source
        hxs = Selector(text = self.driver.page_source)
        #initialize Html2Text
        h = html2text.HTML2Text()
        raw_tweets = hxs.xpath("//p[contains(@class,'tweet-text')]").extract()
        raw_names = hxs.xpath("//span[contains(@class, 'username')]/b/text()").extract()
        #reset counter
        count = 0
        for tweets in raw_tweets:
            #init Item
            item = TwitterItem()
            item['user'] = raw_names[count]
            item['tweet'] = h.handle(raw_tweets[count])
            count += 1
            yield item

设置代码:

# -*- coding: utf-8 -*-

# Scrapy settings for twittermongo project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#

BOT_NAME = 'twittermongo'

SPIDER_MODULES = ['twittermongo.spiders']
NEWSPIDER_MODULE = 'twittermongo.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'twittermongo (+http://www.yourdomain.com)'

#MongoDB settings
ITEM_PIPELINES = {'twittermongo.pipelines.MongoDBPipeline': 100,}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "scraper"
MONGODB_COLLECTION = "tweets"

管道代码:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log

class TwittermongoPipeline(object):
    def process_item(self, item, spider):
        return item

class MongoDBPipeline(object):
        def __init__(self):
                connection = pymongo.Connection(
                        settings['MONGODB_SERVER'],
                        settings['MONGODB_PORT']
                        )
                db = connection[settings['MONGODB_DB']]
                self.collection = db[settings['MONGODB_COLLECTION']]

        def process_items(self, item, spider):
                print("YOYOYOYOYOYOYOYO")
                valid = True
                for data in item:
                        if not data:
                                valid = False
                                raise DropItem("Missing {0}!".format(data))
                                log.msg("Test added to MongoDB", level = log.DEBUG, spider = spider)
                        if valid:
                                self.collection.insert(dict(item))
                                log.msg("Tweet added to MongoDB", level = log.DEBUG, spider = spider)
                        return item

1 个答案:

答案 0 :(得分:3)

该问题与Selenium禁用管道无关。

您的处理方法名称错误:process_items,应该是process_item(单数)。