我有以下结构(样本)。我正在使用scrapy来提取细节。我需要提取' href'的字段。和#39; Accounting'等文字。我使用以下代码。我是Xpath的新手。任何有助于延长特定领域的帮助。
<div class = 'something'>
<ul>
<li><a href="http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="1">Accounting</a></li>
<li><a href="http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="2">Administrative</a></li>
<li><a href="http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="3">Advertising</a></li>
<li><a href="http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="4">Airline</a></li>
</ul>
</div>
我的代码是:
from scrapy.spider import BaseSpider
from jobfetch.items import JobfetchItem
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
def parse(self, response):
count = 0
for sel in response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
item = JobfetchItem()
item['title'] = sel.extract()
item['link'] = sel.extract()
count = count+1
print item
yield item
答案 0 :(得分:2)
您在代码中遇到的问题:
yield item
应该在循环内,因为你在那里实例化项目a
li
内的ul
元素
sel.extract()
只会为您提供ul
元素提取为了举例,请在此处使用CSS selector
转到li
代码:
import scrapy
from jobfetch.items import JobfetchItem
class JobFetchSpider(scrapy.Spider):
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']
def parse(self, response):
for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
item = JobfetchItem()
item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('@href').extract()[0]
yield item
运行蜘蛛产生:
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}
仅供参考,我们也可以使用xpath()
:
//article[@itemprop="articleBody"]//div[@class="expert-content-text"]/ul/li/a
答案 1 :(得分:1)
使用以下脚本提取您要抓取的数据。
In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/text()').extract()
Out[1]:
[u'Accounting',
u'Administrative',
u'Advertising',
u'Airline',
u'Animal',
u'Alternative Energy',
u'Auction House',
u'Banking',
u'Biotechnology',
u'Business',
u'Business Intelligence',
u'Chef',
u'College Admissions',
u'College Alumni Relations and Development ',
u'College Student Services',
u'Construction',
u'Consulting',
u'Corporate',
u'Cruise Ship',
u'Customer Service',
u'Data Science',
u'Engineering',
u'Entry Level Jobs',
u'Environmental',
u'Event Planning',
u'Fashion',
u'Film',
u'First Job',
u'Fundraiser',
u'Healthcare/Medical',
u'Health/Safety',
u'Hospitality',
u'Human Resources',
u'Human Services / Social Work',
u'Information Technology (IT)',
u'Insurance',
u'International Affairs / Development',
u'International Business',
u'Investment Banking',
u'Law Enforcement',
u'Legal',
u'Maintenance',
u'Management',
u'Manufacturing',
u'Marketing',
u'Media',
u'Museum',
u'Music',
u'Non Profit',
u'Nursing',
u'Outdoor ',
u'Public Administration',
u'Public Relations',
u'Purchasing',
u'Radio',
u'Real Estate ',
u'Restaurant',
u'Retail',
u'Sales',
u'School',
u'Science',
u'Ski and Snow Jobs',
u'Social Media',
u'Social Work',
u'Sports',
u'Television',
u'Trades',
u'Transportation',
u'Travel',
u'Yacht Jobs']
In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/@href').extract()
Out[2]:
[u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm',
u'http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/animal-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/alternative-energy-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/auction-house-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/banking-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/biotechnology-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/business-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/business-intelligence-job-titles.htm',
u'http://culinaryarts.about.com/od/culinaryfundamentals/a/whatisachef.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-admissions-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-alumni-relations-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/college-student-service-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/construction-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/consulting-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/c-level-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/cruise-ship-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/customer-service-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/data-science-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/engineering-job-titles.htm',
u'http://jobsearch.about.com/od/best-jobs/a/best-entry-level-jobs.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/environmental-job-titles.htm',
u'http://eventplanning.about.com/od/eventcareers/tp/corporateevents.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/fashion-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/film-job-titles.htm',
u'http://jobsearch.about.com/od/justforstudents/a/first-job-list.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/fundraiser-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/health-care-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/health-safety-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/hospitality-job-titles.htm',
u'http://humanresources.about.com/od/HR-Roles-And-Responsibilities/fl/human-resources-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/human-services-social-work-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/it-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/insurance-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/international-affairs-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/international-business-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/investment-banking-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/law-enforcement-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/legal-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/maintenance-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/management-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/manufacturing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/marketing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/media-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/museum-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/music-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/nonprofit-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/nursing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/outdoor-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/public-administration-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/public-relations-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/purchasing-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/radio-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/real-estate-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/restaurant-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/retail-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/sales-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/high-school-middle-school-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/science-job-titles.htm',
u'http://jobsearch.about.com/od/skiandsnowjobs/a/skijob2_2.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/social-media-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/social-work-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/sports-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/television-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/trades-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/a/transportation-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/travel-job-titles.htm',
u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm']