我目前正在使用Scrapy编写一个Twitter Scraper来编写和处理数据,而Selenium作为一个自动化工具,因为Twitter本身就是一个交互式页面,因此我可以“向下滚动”推文并在一个中获取更多数据扫。
使用我设置的MongoDB管道,它理论上应该将处理后的数据发送到预设的数据库,但由于某种原因没有发送管道,因为我看不到它的调试日志正在运行。
蜘蛛码:
class TwitterScraper(Spider):
query = "nike"
#Using BaseSpider to define rules
##name of spider for "scrapy crawl ____"
name = "twitter"
##allowed_domains contains base-URLs for spider to crawl
allowed_domains = ["twitter.com"]
##start_urls define list of urls for spider to start crawling from.
start_urls = ["https://twitter.com/search?q="+query+"&src=typd&vertical=default"]
#Using PhantomJS in constructer
def parse(self, response):
#Init PhantomJS
self.driver = webdriver.PhantomJS()
#Set Phantom Window Size
self.driver.set_window_size(1120, 550)
#response url defines the url with data to be parsed
self.driver.get(response.url)
#sleeping to fully load Twitter page
time.sleep(1)
#counter scrolls down for amount of pages
count = 0
scroll_down = 2
while count < scroll_down:
#javascript for selenium webdriver scrolls down
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
#allows page to fully load contents
time.sleep(2)
count += 1
print("aggregating tweets! step: " + str(count) + " of " + str(scroll_down))
#init html xpath selection source
hxs = Selector(text = self.driver.page_source)
#initialize Html2Text
h = html2text.HTML2Text()
raw_tweets = hxs.xpath("//p[contains(@class,'tweet-text')]").extract()
raw_names = hxs.xpath("//span[contains(@class, 'username')]/b/text()").extract()
#reset counter
count = 0
for tweets in raw_tweets:
#init Item
item = TwitterItem()
item['user'] = raw_names[count]
item['tweet'] = h.handle(raw_tweets[count])
count += 1
yield item
设置代码:
# -*- coding: utf-8 -*-
# Scrapy settings for twittermongo project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'twittermongo'
SPIDER_MODULES = ['twittermongo.spiders']
NEWSPIDER_MODULE = 'twittermongo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'twittermongo (+http://www.yourdomain.com)'
#MongoDB settings
ITEM_PIPELINES = {'twittermongo.pipelines.MongoDBPipeline': 100,}
MONGODB_SERVER = "localhost"
MONGODB_PORT = 27017
MONGODB_DB = "scraper"
MONGODB_COLLECTION = "tweets"
管道代码:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.conf import settings
from scrapy.exceptions import DropItem
from scrapy import log
class TwittermongoPipeline(object):
def process_item(self, item, spider):
return item
class MongoDBPipeline(object):
def __init__(self):
connection = pymongo.Connection(
settings['MONGODB_SERVER'],
settings['MONGODB_PORT']
)
db = connection[settings['MONGODB_DB']]
self.collection = db[settings['MONGODB_COLLECTION']]
def process_items(self, item, spider):
print("YOYOYOYOYOYOYOYO")
valid = True
for data in item:
if not data:
valid = False
raise DropItem("Missing {0}!".format(data))
log.msg("Test added to MongoDB", level = log.DEBUG, spider = spider)
if valid:
self.collection.insert(dict(item))
log.msg("Tweet added to MongoDB", level = log.DEBUG, spider = spider)
return item
答案 0 :(得分:3)
该问题与Selenium禁用管道无关。
您的处理方法名称错误:process_items
,应该是process_item
(单数)。