我在使用Scrapy收集的数据时遇到了问题。看来,当我在终端中运行此代码时,收集的信息全部附加到一个看起来像这样的项目中:
{"fax": ["Fax: 617-638-4905", "Fax: 925-969-1795", "Fax: 913-327-1491", "Fax: 507-281-0291", "Fax: 509-547-1265", "Fax: 310-437-0585"],
"title": ["Challenges in Musculoskeletal Rehabilitation", "17th Annual Spring Conference on Pediatric Emergencies", "19th Annual Association of Professors of Human & Medical Genetics (APHMG) Workshop & Special Interest Groups Meetings", "2013 AMSSM 22nd Annual Meeting", "61st Annual Meeting of Pacific Coast Reproductive Society (PCRS)", "Contraceptive Technology Conference 25th Anniversary", "Mid-America Orthopaedic Association 2013 Meeting", "Pain Management", "Peripheral Vascular Access Ultrasound", "SAGES 2013 / ISLCRS 8th International Congress"], ... ...
......等等。
问题在于每个字段的所有抓取信息都会在一个项目中结束。我需要将这些信息作为单独的项目出现。换句话说,我需要每个标题与一个传真号码(如果存在)和一个位置相关联,依此类推。
我不希望所有信息都显示在一起,因为收集的每条信息都与其他信息有一定关系。我最终希望它进入数据库的方式如下:
“MedEconItem”1:[标题:“在此处插入标题1”,传真:“在此处插入传真#1”,位置:“位置1”...]
“MedEconItem”2:[title:“title 2”,fax:“fax#2”,location:“location 2”...]
“MedEconItem”3:[...依此类推
有关如何解决此问题的任何想法?有人知道如何轻松分离这些信息吗?这是我第一次使用Scrapy,所以欢迎任何建议。我到处搜寻,似乎无法找到答案。
这是我目前的代码:
import scrapy
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
class MedEconItem(Item):
title = Field()
date = Field()
location = Field()
specialty = Field()
contact = Field()
phone = Field()
fax = Field()
email = Field()
url = Field()
class autoupdate(BaseSpider):
name = "medecon"
allowed_domains = ["www.doctorsreview.com"]
start_urls = [
"http://www.doctorsreview.com/meetings/search/?region=united-states&destination=all&specialty=all&start=YYYY-MM-DD&end=YYYY-MM-DD",
]
def serialize_field(self, field, name, value):
if field == '':
return super(MedEconItem, self).serialize_field(field, name, value)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]')
items = []
for site in sites:
item = MedEconItem()
item['title'] = site.select('//h3/a/text()').extract()
item['date'] = site.select('//p[@class = "dls"]/span[@class = "date"]/text()').extract()
item['location'] = site.select('//p[@class = "dls"]/span[@class = "location"]/a/text()').extract()
item['specialty'] = site.select('//p[@class = "dls"]/span[@class = "specialties"]/text()').extract()
item['contact'] = site.select('//p[@class = "contact"]/text()').extract()
item['phone'] = site.select('//p[@class = "phone"]/text()').extract()
item['fax'] = site.select('//p[@class = "fax"]/text()').extract()
item['email'] = site.select('//p[@class = "email"]/text()').extract()
item['url'] = site.select('//p[@class = "website"]/a/@href').extract()
items.append(item)
return item
答案 0 :(得分:0)
嗯,下面的代码似乎可行,但遗憾的是涉及一些明显的黑客,因为我在xpath很糟糕。精通xpath的人可能会在以后提供更好的解决方案。
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//html/body/div[@id="c"]/div[@id="meeting_results"]//a[contains(@href,"meetings")]')
items = []
for site in sites[1:-1]:
item = MedEconItem()
item['title'] = site.select('./text()').extract()
item['date'] = site.select('./following::p[@class = "dls"]/span[@class="date"]/text()').extract()[0]
item['location'] = site.select('./following::p[@class = "dls"]/span[@class = "location"]/a/text()').extract()[0]
item['specialty'] = site.select('./following::p[@class = "dls"]/span[@class = "specialties"]/text()').extract()[0]
item['contact'] = site.select('./following::p[@class = "contact"]/text()').extract()[0]
item['phone'] = site.select('./following::p[@class = "phone"]/text()').extract()[0]
item['fax'] = site.select('./following::p[@class = "fax"]/text()').extract()[0]
item['email'] = site.select('./following::p[@class = "email"]/text()').extract()[0]
item['url'] = site.select('./following::p[@class = "website"]/a/@href').extract()[0]
items.append(item)
return items