scrapy从脚本运行蜘蛛

时间:2014-02-09 17:53:13

标签: python python-2.7 scrapy

我想从脚本而不是scrap crawl

运行我的蜘蛛

我找到了这个页面

http://doc.scrapy.org/en/latest/topics/practices.html

但实际上并没有说明该脚本的放置位置。

有什么帮助吗?

4 个答案:

答案 0 :(得分:22)

简单明了:)

只需查看official documentation即可。我会在那里进行一些改动,这样你就可以控制蜘蛛只在你 public class NavDrawerItem { private boolean showNotify; private String title; public NavDrawerItem() { } public NavDrawerItem(boolean showNotify, String title) { this.showNotify = showNotify; this.title = title; } public boolean isShowNotify() { return showNotify; } public void setShowNotify(boolean showNotify) { this.showNotify = showNotify; } public String getTitle() { return title; } public void setTitle(String title) { this.title = title; } } DrawerActivity.java : public class DrawerActivity extends AppCompatActivity implements FragmentDrawer.FragmentDrawerListener{ private Toolbar mToolbar; private FragmentDrawer drawerFragment; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); ButterKnife.bind(this); mToolbar = (Toolbar) findViewById(R.id.toolbar); setSupportActionBar(mToolbar); drawerFragment = (FragmentDrawer) getSupportFragmentManager().findFragmentById(R.id.fragment_navigation_drawer); drawerFragment.setUp(R.id.fragment_navigation_drawer, (DrawerLayout) findViewById(R.id.drawer_layout), mToolbar); drawerFragment.setDrawerListener(this); // display the first navigation drawer view on app launch displayView(0); } @Override public boolean onOptionsItemSelected(MenuItem item) { switch (item.getItemId()) { case R.id.action_settings: return true; case R.id.action_search: Toast.makeText(getApplicationContext(), "Search action is selected!", Toast.LENGTH_SHORT).show(); return true; default: return super.onOptionsItemSelected(item); } } @Override public void onDrawerItemSelected(View view, int position) { displayView(position); } private void displayView(int position) { Fragment fragment = null; String title = getString(R.string.app_name); switch (position) { case 0: fragment = new HomeFragment(); title = getString(R.string.title_home); break; case 1: fragment = new FriendsFragment(); title = getString(R.string.title_friends); break; case 2: fragment = new MessagesFragment(); title = getString(R.string.title_messages); break; default: break; } if (fragment != null) { FragmentManager fragmentManager = getSupportFragmentManager(); FragmentTransaction fragmentTransaction = fragmentManager.beginTransaction(); fragmentTransaction.replace(R.id.container, fragment); fragmentTransaction.commit(); // set the toolbar title getSupportActionBar().setTitle(title); } } } 时运行,而不是每次只是从它导入。只需添加python myscript.py

即可
if __name__ == "__main__"

现在将文件保存为import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.Spider): # Your spider definition pass if __name__ == "__main__": process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(MySpider) process.start() # the script will block here until the crawling is finished 并运行'python myscript.py`。

享受!

答案 1 :(得分:5)

幸运的是scrapy源是开放的,所以你可以按照crawl command的方式工作,并在你的代码中做同样的事情:

...
crawler = self.crawler_process.create_crawler()
spider = crawler.spiders.create(spname, **opts.spargs)
crawler.crawl(spider)
self.crawler_process.start()

答案 2 :(得分:2)

您可以创建一个普通的Python脚本,然后使用Scrapy的命令行选项runspider,它允许您在不必创建项目的情况下运行蜘蛛。

例如,您可以使用以下内容创建单个文件stackoverflow_spider.py

import scrapy

class QuestionItem(scrapy.item.Item):
    idx = scrapy.item.Field()
    title = scrapy.item.Field()

class StackoverflowSpider(scrapy.spider.Spider):
    name = 'SO'
    start_urls = ['http://stackoverflow.com']
    def parse(self, response):
        sel = scrapy.selector.Selector(response)
        questions = sel.css('#question-mini-list .question-summary')
        for i, elem in enumerate(questions):
            l = scrapy.contrib.loader.ItemLoader(QuestionItem(), elem)
            l.add_value('idx', i)
            l.add_xpath('title', ".//h3/a/text()")
            yield l.load_item()

然后,如果您正确安装了scrapy,则可以使用以下命令运行它:

scrapy runspider stackoverflow_spider.py -t json -o questions-items.json

答案 3 :(得分:0)

为什么不这样做?

from scrapy import cmdline

cmdline.execute("scrapy crawl myspider".split())

将该脚本放在放置scrapy.cfg

的同一路径中