如何使用scrapy爬行到每个href

时间:2013-07-10 14:00:17

标签: xml screen-scraping scrapy

如何使用scrapy对每个href进行爬行?我只知道如何显示它,但我希望能够进入每个链接。这是我们的Intranet数据,因此您无法访问这些链接。另外,如何在数据显示在文件中时格式化日期?我是否需要在start_url中添加网址列表?我是否需要将initSpider更改为crawlSpider?

<row>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">14256238845</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">353918053831794</cell>
<cell type="href" href="/dis/packages.jsp?view=list&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100">310260548400764</cell>
<cell type="href" href="/dis/packages.jsp?view=timeline&show=perdevice&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&subscrbid=310260548400764&mdn=14256238845&maxlength=100&date=20130423T020032243">2013-04-23 02:00:32.243</cell>
<cell type="plain">2013-04-23 02:00:32.243</cell>
<cell type="plain">3 - PackageCreation</cell>
<cell type="href" href="/dis/profile_download?profileId=400006">400006</cell>
<cell type="href" href="/dis/sessions.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view sessions</cell>
<cell type="href" href="/dis/errors_agg.jsp?view=list&device_gid=6F5941585835587177572B3465656A61496B76747A673D3D54766B47446C376A77555A72624237756330506755673D3D&hwdid=353918053831794&mdn=14256238845&subscrbid=310260548400764&maxlength=100">view errors</cell>
</row>

这是我到目前为止打印的所有内容

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from scrapy.selector import XmlXPathSelector

from carrier.items import CarrierItem

class CarrierSpider(InitSpider):
    name = 'dis'
    allowed_domains = ['qvpweb01.ciq.labs.att.com']
    login_page = 'https://qvpweb01.ciq.labs.att.com:8080/dis/login.jsp'
    start_urls = ["https://qvpweb01.ciq.labs.att.com:8080/dis/"]

    def init_request(self):
    #"""This function is called before crawling starts."""
    return Request(url=self.login_page, callback=self.login)

    def login(self, response):
    #"""Generate a login request."""
    return FormRequest.from_response(response,
            formdata={'txtUserName': 'myuser', 'txtPassword': 'xxxx'},
            callback=self.check_login_response)

    def check_login_response(self, response):
    #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
    if "logout" in response.body:
        self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
        # Now the crawling can begin..

        return self.initialized() 

    else:
        self.log("\n\n\nFailed, Bad password :(\n\n\n")
        # Something went wrong, we couldn't log in, so nothing happens.


    def parse(self, response):
    xhs = XmlXPathSelector(response)
    columns = xhs.select('//table[3]/row/cell')
    for column in columns:
        item = CarrierItem()
        item['title'] = column.select('.//text()').extract()
        item['link'] = column.select('.//@href').extract()
        yield item

输出我从下面的csv文件获得:

14256238845
3.53918E+14
3.10261E+14
00:32.2
00:32.2
3 - PackageCreation
400006
view sessions
view errors

希望从csv输出我希望得到的信息:

14256238845
353918053831794
310260548400764
2013-04-23 02:00:32.243
2013-04-23 02:00:32.243
3 - PackageCreation
400006
view sessions
view errors

1 个答案:

答案 0 :(得分:0)

每当您想要关注URL时,您都可以产生一个Request对象 例如:yield Request(extracted_url_link, callback=your_parse_function)

请查看以下链接中的第二个示例 http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-example

指定抓取网址的另一种方法是使用SgmlLinkExtractor。你可以写规则。如果规则匹配,Spider将抓取任何页面中的所有网址。请参阅以下网址中的示例 http://doc.scrapy.org/en/latest/topics/spiders.html#crawlspider

日期只是抓取后的一个字符串,您可以将其转换为python datetime对象,然后使用日期时间渲染函数(如strftime)以您希望的方式显示它。

希望我回答你的问题。