我想从开始网址中提到的网站中提取副作用,警告,剂量等内容。以下是我的代码。正在创建csv文件,但不显示任何内容。输出是:
before for
[] # it is displaying empty list
after for
这是我的代码:
from scrapy.selector import Selector
from medicinelist_sample.items import MedicinelistSampleItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MedSpider(CrawlSpider):
name = "med"
allowed_domains = ["medindia.net"]
start_urls = ["http://www.medindia.net/doctors/drug_information/home.asp?alpha=z"]
rules = [Rule(SgmlLinkExtractor(allow=('Zafirlukast.htm',)), callback="parse", follow = True),]
global Selector
def parse(self, response):
hxs = Selector(response)
fullDesc = hxs.xpath('//div[@class="report-content"]//b/text()')
final = fullDesc.extract()
print "before for" # this is just to see if it was printing
print final
print "after for" # this is just to see if it was printing
答案 0 :(得分:0)
您的scrapy
蜘蛛类的parse
方法应return item(s)
。使用当前代码,我看不到任何项目被返回。一个例子是,
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
sel = Selector(response)
item = Item()
item['id'] = sel.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = sel.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = sel.xpath('//td[@id="item_description"]/text()').extract()
return item
有关详细信息,请查看CrawlSpider example中的official scrapy docs。
答案 1 :(得分:0)
代码中的另一个问题是您要覆盖CrawlSpider的解析方法来实现回调逻辑。由于在其逻辑中使用了 parse 方法,所以不能使用CrawlSpiders。
Ashish Nitin Patil隐含地指出已经通过命名他的示例函数* parse_item *。
Crawl Spider的解析方法的默认实现基本上是调用您在规则定义中指定的回调;因此,如果你覆盖它,我认为你的回调根本不会被调用。见Scrapy Doc - crawling rules
答案 2 :(得分:0)
我刚刚对您正在抓取的网站进行了一些实验。由于您希望从该域中的不同站点中提取有关药物的一些数据(如名称,适应症,禁忌症等):以下或类似的XPath表达式是否符合您的需求?我认为您当前的查询只会为您提供“标题”,但此网站上的实际信息位于跟随这些粗体呈现标题的文本节点中。
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Test.items import TestItem
from scrapy.item import Item, Field
class Medicine(Item):
name = Field()
dosage = Field()
indications = Field()
contraindications = Field()
warnings = Field()
class TestmedSpider(CrawlSpider):
name = 'testmed'
allowed_domains = ['http://www.medindia.net/doctors/drug_information/']
start_urls = ['http://www.http://www.medindia.net/doctors/drug_information/']
rules = (
Rule(SgmlLinkExtractor(allow=r'Zafirlukast.htm'), callback='parse_item', follow=True),
)
def parse_item(self, response):
drug_info = Medicine()
selector = Selector(response)
name = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Generic Name')]//..//following-sibling::text()[1])''')
dosage = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Dosage')]//..//following-sibling::text()[1])''')
indications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Why it is prescribed (Indications)')]//..//following-sibling::text()[1])''')
contraindications = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Contraindications')]//..//following-sibling::text()[1])''')
warnings = selector.xpath(r'''normalize-space(//div[@class="report-content"]//b/text()[contains(., 'Warnings and Precautions')]//..//following-sibling::text()[1])''')
drug_info['name'] = name.extract()
drug_info['dosage'] = dosage.extract()
drug_info['indications'] = indications.extract()
drug_info['contraindications'] = contraindications.extract()
drug_info['warnings'] = warnings.extract()
return drug_info
这会给你以下信息:
>scrapy parse --spider=testmed --verbose -d 2 -c parse_item --logfile C:\Python27\Scripts\Test\Test\spiders\test.log http://www.medindia.net/doctors/drug_information/Zafirlukast.htm
>>> DEPTH LEVEL: 1 <<<
# Scraped Items ------------------------------------------------------------
[{'contraindications': [u'Hypersensitivity.'],
'dosage': [u'Adult- The recommended dose is 20 mg twice daily.'],
'indications': [u'This medication is an oral leukotriene receptor antagonist (
LTRA), prescribed for asthma. \xa0It blocks the action of certain natural substa
nces that cause swelling and tightening of the airways.'],
'name': [u'\xa0Zafirlukast'],
'warnings': [u'Caution should be exercised in patients with history of liver d
isease, mental problems, suicidal thoughts, any allergy, elderly, during pregnan
cy and breastfeeding.']}]