递归Scrapy爬行问题

时间:2014-06-21 16:31:03

标签: python web-scraping scrapy web-crawler scrapy-spider

我正在尝试使用递归蜘蛛从具有特定链接结构的站点(例如:web.com)中提取内容。例如:

http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21

http://web.com/location/profile/98765432?qid=1403366850.3991&source=locaton&rank=1

正如您所看到的,只有网址的数字部分正在发生变化,我需要抓取此网址结构后面的所有链接并提取itemX,itemY和itemZ。

我已将链接结构翻译为正则表达式:' \ d +?qid = \ d +。\ d +& source = location& rank = \ d +'。然而,Python-Scrapy代码如下,但是,在我运行蜘蛛之后,蜘蛛没有提取任何东西:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from web.items import webItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
import re
import urllib

class web_RecursiveSpider(CrawlSpider):
    name = "web_RecursiveSpider"
    allowed_domains = ["web.com"]
    start_urls = ["http://web.com/location/profile",]

    rules = (Rule (SgmlLinkExtractor(allow=('\d+?qid=\d+.\d+&source=location&rank=\d+', ),) 
    , callback="parse_item", follow= True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//*')
        items = []

        for site in sites:
            item = webItem()
            item["itemX"] = site.select("//span[@itemprop='X']/text()").extract()
            item["itemY"] = site.select("//span[@itemprop='Y']/text()").extract()
            item["itemZ"] = site.select("//span[@itemprop='Z']/text()").extract()
            items.append(item)
        return items

1 个答案:

答案 0 :(得分:1)

您需要在正则表达式中转义?标记:

'\d+\?qid=\d+.\d+&source=location&rank=\d+'
    ^

演示:

>>> import re
>>> url = "http://web.com/location/profile/12345678?qid=1403226397.5971&source=location&rank=21"
>>> print re.search('\d+?qid=\d+.\d+&source=location&rank=\d+', url)
None
>>> print re.search('\d+\?qid=\d+.\d+&source=location&rank=\d+', url)
<_sre.SRE_Match object at 0x10be538b8>

请注意,您还需要转义点,但它不会影响您提供的示例:

'\d+\?qid=\d+\.\d+&source=location&rank=\d+'
             ^