Django Dynamic Sc​​raper STANDARD(UPDATE)意味着强制性元素?

时间:2013-01-10 13:47:31

标签: python django scrapy

问题:

在Django Dynamic Sc​​raper中,似乎将一个被抓取的对象类'scraped_obj_attr类型设置为STANDARD(UPDATE)会导致该元素被认为是必需的,并且如果它不能获取该元素的数据,则spider将丢弃该结果。当我将类型更改为STANDARD时,蜘蛛不会在没有获得该元素时丢弃结果。这是不可取的,因为如果碰巧存在,我想要更新可选元素。

经过:

我已确认数据库在dynamic_scraper_scraperelem表中包含相应行的“强制”列的“f”。 Django admin中的表格也没有显示该元素的强制性标记。我还阅读了有关该主题的所有相关文档和Q& A线程。关于DDS /我的Django安装的其他所有内容似乎都很有用。

我深入研究了一下代码,并在dynamic_scraper.pipelines,ValidationPipeline类中找到了它。第5行似乎很有意义 - 也许这个“如果”条件意外得到满足?

class ValidationPipeline(object):

    def process_item(self, item, spider):

        url_elem = spider.scraper.get_detail_page_url_elem()
        url_name = url_elem.scraped_obj_attr.name
        if url_name in item and item[url_name][0:6] == 'DOUBLE':
            mandatory_elems = spider.scraper.get_standard_update_elems()
        else:
            mandatory_elems = spider.scraper.get_mandatory_scrape_elems()
        for elem in mandatory_elems:
            if not elem.scraped_obj_attr.name in item or\
                (elem.scraped_obj_attr.name in item and not    item[elem.scraped_obj_attr.name]):
                spider.log("Mandatory elem " + elem.scraped_obj_attr.name + " missing!", log.ERROR)
                raise DropItem()
        ...

问题:

这是预期的行为吗?

配置:

Ubuntu 12.04.1用于Python 2.7.3,Postgresql 9.1和Supervisor的32位系统包。 Nginx是PPA的最新稳定版。正在手动运行蜘蛛进行测试:“scrapy crawl my_spider -a id = 1 -a do_action = yes”。

my_scraper.scraper.settings.py:

import sys
import os.path

PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
sys.path = sys.path + [os.path.join(PROJECT_ROOT, '../../..'), os.path.join(PROJECT_ROOT, '../..')]

from django.core.management import setup_environ
import mysite.settings
setup_environ(mysite.settings)

BOT_NAME = 'mybot'
BOT_VERSION = '1.0'

SPIDER_MODULES = ['dynamic_scraper.spiders', 'my_scraper.scraper',]
USER_AGENT = '%s/%s' % (BOT_NAME, BOT_VERSION)

ITEM_PIPELINES = [
    'dynamic_scraper.pipelines.DjangoImagesPipeline',
    'dynamic_scraper.pipelines.ValidationPipeline',
    'my_scraper.scraper.pipelines.DjangoWriterPipeline',
]

IMAGES_STORE = os.path.join(PROJECT_ROOT, '../thumbnails')

DSCRAPER_LOG_ENABLED = True
DSCRAPER_LOG_LEVEL = 'INFO'
DSCRAPER_LOG_LIMIT = 5

DOWNLOAD_DELAY = 2
CONCURRENT_SPIDERS = 1

的virtualenv:

$ pip freeze
Django==1.4.3
Fabric==1.5.1
PIL==1.1.7
Pillow==1.7.8
Scrapy==0.14.4
South==0.7.6
Twisted==12.3.0
amqp==1.0.6
amqplib==1.0.2
anyjson==0.3.3
argparse==1.2.1
billiard==2.7.3.19
celery==2.5.3
django-appconf==0.5
django-celery==2.5.5
django-dynamic-scraper==0.2.3
django-forms-bootstrap==2.0.3.post1
django-kombu==0.9.4
django-picklefield==0.3.0
django-user-accounts==1.0b7
gevent==0.13.8
greenlet==0.4.0
gunicorn==0.17.1
httplib2==0.7.7
kombu==2.1.8
lxml==3.1beta1
metron==1.0
numpy==1.6.2
oauth2==1.5.211
paramiko==1.9.0
pinax-theme-bootstrap==2.2.2
pinax-theme-bootstrap-account==1.0b2
pinax-utils==1.0b1.dev3
psycopg2==2.4.6
pyOpenSSL==0.13
pycrypto==2.6
python-dateutil==1.5
pytz==2012d
six==1.2.0
w3lib==1.2
wsgiref==0.1.2
zope.interface==4.0.3

0 个答案:

没有答案