如何使用scrapyd部署crawler时从python egg访问文件资源?

时间:2016-11-28 13:52:51

标签: python scrapy scrapy-spider scrapyd

我想从scrapyd加载一个JSON文件,但它似乎永远不会找到文件,无论我使用哪种引用方法。

通常我有open方法后跟文件名,我把文件名放在egg根目录以及项目egg文件夹中,但是在任何情况下都找不到文件。

如果有人知道如何用完整的例子来做这件事,我会非常乐于助人。

我的setup.py如下所示

from setuptools import setup, find_packages

import os, sys
directory, filename = os.path.split(os.path.realpath(__file__))
sys.path.append(directory)

setup(
    name='myscraper',
    version='1.0',
    packages=find_packages(),
    entry_points={'scrapy': ['settings = myscraper.local_settings']},
    install_requires=['selenium', 'scrapy', 'pyyaml', 'yamlordereddictloader', 'pyvirtualdisplay'],
    package_data={'mypackage': ['myscraper/configuration/seeds.json', 'myscraper/configuration/*.yml'],
                  },
    data_files=[("mydata", ["myscraper/configuration/seeds.json"])],
    include_package_data=True,
    zip_safe=False
)

项目结构

- my_crawler
--- setup.py
--- myscraper
------- configuration
-------------seeds.json
------- myspider.py
------- ...

如何在myspider.py中读取json文件? 如何读取配置文件夹中的所有yaml文件?

我想使用类似的代码:

# how to get the content from seeds.json ?

content = pkgutil.get_data('mypackage', filename)

# how to walk the configuration directory from the egg?

for root, dirs, files in os.walk(config_dir):
        for config_file in files:
            config_file = open(os.path.join(root, config_file))
            config_dict = yaml.load(config_file, Loader=yamlordereddictloader.Loader)
            visit = config_dict.get("visit")
            self.configuration[visit] = config_dict
  • 相关主题
蛋警告

https://github.com/scrapy/scrapyd-client

https://groups.google.com/forum/#!msg/scrapy-users/B70eq1_N3Fk/vR7aDeizj_sJ

https://support.scrapinghub.com/topics/1717-deploying-projects-with-resource-non-code-files/

https://support.scrapinghub.com/topics/725-including-additional-files-with-a-spider/

1 个答案:

答案 0 :(得分:0)

from png_resources import resource_string
...

    file_string = resource_string(
        __name__.split('.')[0], 
        'myscraper/configuration/seeds.json',
    )