Scrapy抓取提取的链接

时间:2015-10-24 12:29:13

标签: python hyperlink scrapy extractor

我需要抓取一个网站,并在特定的xpath上抓取该网站的每个网址 例如。: 我需要抓取容器中有10个链接的“http://someurl.com/world/”(xpath(“// div [@ class ='pane-content']”)),我需要抓取所有这10个链接并提取图像从他们,但“http://someurl.com/world/”中的链接看起来像 “http://someurl.com/node/xxxx

我到现在为止:

    --------- beginning of system
10-24 17:40:01.419    1309-1675/? I/ActivityManager﹕ START u0 {act=android.intent.action.MAIN cat=[android.intent.category.LAUNCHER]  flg=0x10200000 cmp=com.apsdevelopers.mr.meteout/.mottoscreen (has extras)} from uid 10008 on display 0
10-24 17:40:01.665    1917-1917/? I/Choreographer﹕ Skipped 120 frames!  The application may be doing too much work on its main thread.
10-24 17:40:01.950    2026-2037/? I/art﹕ CollectorTransition marksweep + semispace GC freed 1303(40KB) AllocSpace objects, 0(0B) LOS objects, 42% free, 697KB/1209KB, paused 149.640ms total 149.640ms
10-24 17:40:01.961    1917-1917/? I/Choreographer﹕ Skipped 73 frames!  The application may be doing too much work on its main thread.
10-24 17:40:02.284    1917-1917/? I/Choreographer﹕ Skipped 32 frames!  The application may be doing too much work on its main thread.
10-24 17:40:02.444    1917-1917/? I/Choreographer﹕ Skipped 37 frames!  The application may be doing too much work on its main thread.
10-24 17:40:02.595    1917-1917/? I/Choreographer﹕ Skipped 30 frames!  The application may be doing too much work on its main thread.
10-24 17:40:02.733    1917-1917/? I/Choreographer﹕ Skipped 34 frames!  The application may be doing too much work on its main thread.
10-24 17:40:02.873    1917-1917/? I/Choreographer﹕ Skipped 30 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.152    1917-1917/? I/Choreographer﹕ Skipped 37 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.257    1309-1328/? I/Choreographer﹕ Skipped 411 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.445    1917-1917/? I/Choreographer﹕ Skipped 46 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.537    1309-1328/? I/Choreographer﹕ Skipped 70 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.606    1917-1917/? I/Choreographer﹕ Skipped 39 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.892    1917-1917/? I/Choreographer﹕ Skipped 34 frames!  The application may be doing too much work on its main thread.
10-24 17:40:03.888    1309-1328/? I/Choreographer﹕ Skipped 52 frames!  The application may be doing too much work on its main thread.
10-24 17:40:04.597    1309-1328/? I/ActivityManager﹕ Displayed com.apsdevelopers.mr.meteout/.mottoscreen: +2s813ms
10-24 17:40:04.814    1309-1328/? I/Choreographer﹕ Skipped 30 frames!  The application may be doing too much work on its main thread.

1 个答案:

答案 0 :(得分:2)

您可以重写您的规则'满足您的所有要求:

rules = [Rule(LinkExtractor(allow=('/node/.*',), restrict_xpaths=('//div[@class="pane-content"]',)), callback='parse_imgur', follow=True)]

要从提取的图片链接下载图片,您可以使用Scrapy捆绑的ImagePipeline