scrapy用循环刮擦

时间:2013-10-20 01:50:51

标签: loops web-scraping scrapy

我想从http://www.stfrancismedical.org/asp/job-summary.asp?cat=4抓取信息,但我不知道如何,因为我所知道的只是递归刮。有没有办法使用循环或任何其他想法刮取或获取每个工作的所有信息将是伟大的。提前谢谢!

1 个答案:

答案 0 :(得分:1)

该页面的结构有点奇怪。一个表,其所有行都处于相同的级别深度。这使得xpath更难以同时提取每个作业的所有数据。我的方法是使用模块运算符并为每个循环填充item对象。

无论如何,该页面没有链接,因此使用蜘蛛非常直接。

第一步,创建项目:

scrapy startproject stfrancismedical
cd stfrancismedical

第二步,创建蜘蛛:

scrapy genspider -t basic stfrancismedical_spider 'stfrancismedical.org'

第三步,使用作业的所有字段创建item

vim stfrancismedical/items.py

使用以下新内容:

from scrapy.item import Item, Field

class StfrancismedicalItem(Item):
    department = Field()
    employment = Field()
    shift = Field()
    weekends_holidays = Field()
    biweekly_hours = Field()
    description = Field()
    requirements = Field()

第四步,编辑蜘蛛:

vim stfrancismedical/spiders/stfrancismedical_spider.py

内容:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from stfrancismedical.items import StfrancismedicalItem

rn = ('department', 'employment', 'shift', 'weekends_holidays',
        'biweekly_hours', 'description', 'requirements')

class StfrancismedicalSpiderSpider(BaseSpider):
    name = "stfrancismedical_spider"
    allowed_domains = ["stfrancismedical.org"]
    start_urls = ( 
        'http://www.stfrancismedical.org/asp/job-summary.asp?cat=4',
    )   


    def parse(self, response):
        items = []
        hxs = HtmlXPathSelector(response)
        for i, tr in enumerate(hxs.select('/html/body/div/table//tr[count(./td)=2]')):
            if (i % 7 == 0): 
                if (i > 0): items.append(item)
                item = StfrancismedicalItem()
            idx = i % 7 
            item[rn[idx]] = tr.select('./td[2]//text()').extract()[0]
        else:
            items.append(item)
        return items

然后运行它:

scrapy crawl stfrancismedical_spider -o stfrancismedical.json -t json

这会创建一个包含数据的新文件stfrancismedical.json

[{"requirements": "Skilled in Cath Lab nursing, 2 years experience and patient recovery experience. A Current valid NJ RN license with a current ACLS certification.", "description": "Responsible for the delivery of individualized patient care to assigned patients utilizing the nursing process of assessment, planning, implementation and evaluation.", "shift": "Day - Evening - Night", "biweekly_hours": "Varied", "weekends_holidays": "No", "department": "Cardiac Care", "employment": "Pool"},
{"requirements": "Requirements: A Current valid NJ RN license with a current ACLS & BLS certification.", "description": "Responsible for the delivery of individualized patient care to assigned critical care patients utilizing the nursing process of assessment, planning, implementation and evaluation. ", "shift": "Evening", "biweekly_hours": "72", "weekends_holidays": "Yes", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "ACLS, NJ License required.\u00a0 Balloon pump certification preferred.", "description": "Provide comprehensive Nursing care to critically ill patients.\u00a0 ", "shift": "Day", "biweekly_hours": "72 - 11am - 11pm", "weekends_holidays": "Yes", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "ACLS, NJ License required.\u00a0 Balloon pump certification preferred.", "description": "Provide comprehensive Nursing care to critically ill patients. ", "shift": "Evening - Night", "biweekly_hours": "72 - 7pm - 7am", "weekends_holidays": "No", "department": "Critical Care Unit", "employment": "Full-Time"},
{"requirements": "Associates Degree in Nursing, Healthcare, or equivalent experience: BSN preferred.", "description": "Must be detail oriented and able to follow detailed procedures to ensure accuracy.\u00a0 Must demonstrate excellent follow up skills.\u00a0 Ability to coordinate and priortize multiple duties.\u00a0 Understands interactions amongst clinical areas and their roles within hospital.\u00a0 Advanced knowledge in computer skills, including knowledge of Microsoft Word, Excel and PowerPoint.\u00a0", "shift": "Day", "biweekly_hours": "80", "weekends_holidays": "No", "department": "Nursing Education", "employment": "Full-Time"},
...