将HTML刮缩为CSV格式

时间:2014-03-08 10:36:02

标签: python scrapy

我想从开始网址中提到的网站中提取副作用,警告,剂量等内容。以下是我的代码。正在创建csv文件,但不显示任何内容。输出是:

before for
[] # it is displaying empty list
after for
这是我的代码:
from scrapy.selector import Selector
from medicinelist_sample.items import MedicinelistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class MedSpider(CrawlSpider):
    name = "med"
    allowed_domains = ["medindia.net"]
    start_urls = ["http://www.medindia.net/doctors/drug_information/home.asp?alpha=z"]
    rules = [Rule(SgmlLinkExtractor(allow=('Zafirlukast.htm',)), callback="parse", follow = True),]

    global Selector

    def parse(self, response):
        hxs = Selector(response)
        fullDesc = hxs.xpath('//div[@class="report-content"]//b/text()')
        final = fullDesc.extract()

        print "before for" # this is just to see if it was printing

        print final
        print "after for"  # this is just to see if it was printing

3 个答案:

答案 0 :(得分:0)

您的scrapy蜘蛛类的parse方法应return item(s)。使用当前代码,我看不到任何项目被返回。一个例子是,

def parse_item(self, response):
    self.log('Hi, this is an item page! %s' % response.url)

    sel = Selector(response)
    item = Item()
    item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
    return item

有关详细信息,请查看CrawlSpider example中的official scrapy docs

答案 1 :(得分:0)

代码中的另一个问题是您要覆盖CrawlSpider的解析方法来实现回调逻辑。由于在其逻辑中使用了 parse 方法,所以不能使用CrawlSpiders。

Ashish Nitin Patil隐含地指出已经通过命名他的示例函数* parse_item *。

Crawl Spider的解析方法的默认实现基本上是调用您在规则定义中指定的回调;因此,如果你覆盖它,我认为你的回调根本不会被调用。见Scrapy Doc - crawling rules

答案 2 :(得分:0)

我刚刚对您正在抓取的网站进行了一些实验。由于您希望从该域中的不同站点中提取有关药物的一些数据(如名称,适应症,禁忌症等):以下或类似的XPath表达式是否符合您的需求?我认为您当前的查询只会为您提供“标题”,但此网站上的实际信息位于跟随这些粗体呈现标题的文本节点中。

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Test.items import TestItem

from scrapy.item import Item, Field

class Medicine(Item):
    name = Field()
    dosage = Field()
    indications = Field()
    contraindications = Field()
    warnings = Field()

class TestmedSpider(CrawlSpider):
    name = 'testmed'
    allowed_domains = ['http://www.medindia.net/doctors/drug_information/']
    start_urls = ['http://www.http://www.medindia.net/doctors/drug_information/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'Zafirlukast.htm'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        drug_info = Medicine()

        selector = Selector(response)
        name = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Generic Name')]//..//following-sibling::text()[1])''')
        dosage = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Dosage')]//..//following-sibling::text()[1])''')
        indications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Why it is prescribed (Indications)')]//..//following-sibling::text()[1])''')
        contraindications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Contraindications')]//..//following-sibling::text()[1])''')
        warnings = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Warnings and Precautions')]//..//following-sibling::text()[1])''')

        drug_info['name'] = name.extract()
        drug_info['dosage'] = dosage.extract()
        drug_info['indications'] = indications.extract()
        drug_info['contraindications'] = contraindications.extract()
        drug_info['warnings'] = warnings.extract()

        return drug_info

这会给你以下信息:

>scrapy parse --spider=testmed --verbose -d 2 -c parse_item --logfile C:\Python27\Scripts\Test\Test\spiders\test.log http://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>> DEPTH LEVEL: 1 <<<
# Scraped Items  ------------------------------------------------------------
[{'contraindications': [u'Hypersensitivity.'],
  'dosage': [u'Adult- The recommended dose is 20 mg twice daily.'],
  'indications': [u'This medication is an oral leukotriene receptor antagonist (
LTRA), prescribed for asthma. \xa0It blocks the action of certain natural substa
nces that cause swelling and tightening of the airways.'],
  'name': [u'\xa0Zafirlukast'],
  'warnings': [u'Caution should be exercised in patients with history of liver d
isease, mental problems, suicidal thoughts, any allergy, elderly, during pregnan
cy and breastfeeding.']}]