我必须执行爬虫并将数据放入数据库中。 我已经收集了我的数据,但是我把它们放在数据库中会有问题。
我的档案是:
topcrawlerspider.py(我的抓取工具,他是功能性的):
from scrapy import Spider, Item, Field, Request
from ..items import TopcrawlerItem
from ..pipelines import TopcrawlerPipeline
import time
class TopSpider(Spider):
name = 'topcrawler'
start_urls = ['...']
def __init__(self, page=0, *args, **kwargs):
super(TopSpider, self).__init__(*args, **kwargs)
self.search_result_url_tpl = 'http://.../%s'
...
settings.py:
BOT_NAME = 'topcrawler'
SPIDER_MODULES = ['topcrawler.spiders']
NEWSPIDER_MODULE = 'topcrawler.spiders'
# Crawl responsibly by identifying yourself (and your website) on the
user-agent
#USER_AGENT = 'topcrawler (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {
'topcrawler.pipelines.TopcrawlerPipeline': 300,
# 'topcrawler.pipelines.JsonWriterPipeline': 800,
}
MONGODB_URI = 'mongodb://root:root@127.0.0.1:8889/mtdbdd'
MONGO_DATABASE = 'mtdbdd'
pipelines.py:
import pymongo
from settings import *
class TopcrawlerPipeline(object):
collection_name = 'land'
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
self.db[self.collection_name].insert(dict(item))
return item
我有错误:
ServerSelectionTimeoutError: localhost:27017: [Errno 8] nodename nor servname provided, or not known
它似乎并没有像我想要的那样连接到8889端口,但我不知道为什么......
感谢您的帮助!
答案 0 :(得分:0)
在TopcrawlerPipeline
类和方法open_spider
(pipelines.py
文件中)中,您创建了重复的client
:
self.client = pymongo.MongoClient(connect=False)
self.client =
pymongo.MongoClient('mongodb://root:root@127.0.0.1:8889/mtdbdd')
我打赌错误来自第一个错误(我认为这是无意的)。删除第一个,只留下第二个。
只是一个旁注,说明错误可能来自哪里。如果未在MongoClient
中指定连接字符串,则会尝试连接到localhost和默认端口27017.检查/etc/hosts
文件以了解localhost的定义方式(我假设您使用的是Linux)。在某些系统上,仅为localhost分配IPv6地址,默认情况下MongoDB不侦听IPv6地址。