在Django Dynamic Scraper中,似乎将一个被抓取的对象类'scraped_obj_attr类型设置为STANDARD(UPDATE)会导致该元素被认为是必需的,并且如果它不能获取该元素的数据,则spider将丢弃该结果。当我将类型更改为STANDARD时,蜘蛛不会在没有获得该元素时丢弃结果。这是不可取的,因为如果做碰巧存在,我想要更新可选元素。
我已确认数据库在dynamic_scraper_scraperelem表中包含相应行的“强制”列的“f”。 Django admin中的表格也没有显示该元素的强制性标记。我还阅读了有关该主题的所有相关文档和Q& A线程。关于DDS /我的Django安装的其他所有内容似乎都很有用。
我深入研究了一下代码,并在dynamic_scraper.pipelines,ValidationPipeline类中找到了它。第5行似乎很有意义 - 也许这个“如果”条件意外得到满足?
class ValidationPipeline(object):
def process_item(self, item, spider):
url_elem = spider.scraper.get_detail_page_url_elem()
url_name = url_elem.scraped_obj_attr.name
if url_name in item and item[url_name][0:6] == 'DOUBLE':
mandatory_elems = spider.scraper.get_standard_update_elems()
else:
mandatory_elems = spider.scraper.get_mandatory_scrape_elems()
for elem in mandatory_elems:
if not elem.scraped_obj_attr.name in item or\
(elem.scraped_obj_attr.name in item and not item[elem.scraped_obj_attr.name]):
spider.log("Mandatory elem " + elem.scraped_obj_attr.name + " missing!", log.ERROR)
raise DropItem()
...
这是预期的行为吗?
Ubuntu 12.04.1用于Python 2.7.3,Postgresql 9.1和Supervisor的32位系统包。 Nginx是PPA的最新稳定版。正在手动运行蜘蛛进行测试:“scrapy crawl my_spider -a id = 1 -a do_action = yes”。
my_scraper.scraper.settings.py:
import sys
import os.path
PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
sys.path = sys.path + [os.path.join(PROJECT_ROOT, '../../..'), os.path.join(PROJECT_ROOT, '../..')]
from django.core.management import setup_environ
import mysite.settings
setup_environ(mysite.settings)
BOT_NAME = 'mybot'
BOT_VERSION = '1.0'
SPIDER_MODULES = ['dynamic_scraper.spiders', 'my_scraper.scraper',]
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)
ITEM_PIPELINES = [
'dynamic_scraper.pipelines.DjangoImagesPipeline',
'dynamic_scraper.pipelines.ValidationPipeline',
'my_scraper.scraper.pipelines.DjangoWriterPipeline',
]
IMAGES_STORE = os.path.join(PROJECT_ROOT, '../thumbnails')
DSCRAPER_LOG_ENABLED = True
DSCRAPER_LOG_LEVEL = 'INFO'
DSCRAPER_LOG_LIMIT = 5
DOWNLOAD_DELAY = 2
CONCURRENT_SPIDERS = 1
的virtualenv:
$ pip freeze
Django==1.4.3
Fabric==1.5.1
PIL==1.1.7
Pillow==1.7.8
Scrapy==0.14.4
South==0.7.6
Twisted==12.3.0
amqp==1.0.6
amqplib==1.0.2
anyjson==0.3.3
argparse==1.2.1
billiard==2.7.3.19
celery==2.5.3
django-appconf==0.5
django-celery==2.5.5
django-dynamic-scraper==0.2.3
django-forms-bootstrap==2.0.3.post1
django-kombu==0.9.4
django-picklefield==0.3.0
django-user-accounts==1.0b7
gevent==0.13.8
greenlet==0.4.0
gunicorn==0.17.1
httplib2==0.7.7
kombu==2.1.8
lxml==3.1beta1
metron==1.0
numpy==1.6.2
oauth2==1.5.211
paramiko==1.9.0
pinax-theme-bootstrap==2.2.2
pinax-theme-bootstrap-account==1.0b2
pinax-utils==1.0b1.dev3
psycopg2==2.4.6
pyOpenSSL==0.13
pycrypto==2.6
python-dateutil==1.5
pytz==2012d
six==1.2.0
w3lib==1.2
wsgiref==0.1.2
zope.interface==4.0.3