Python scrapy提取特定的Xpath字段

时间:2015-01-09 05:39:59

标签: python web-scraping scrapy scrapy-spider

我有以下结构(样本)。我正在使用scrapy来提取细节。我需要提取' href'的字段。和#39; Accounting'等文字。我使用以下代码。我是Xpath的新手。任何有助于延长特定领域的帮助。

<div class = 'something'>
    <ul>
        <li><a href="http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="1">Accounting</a></li> 

        <li><a href="http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="2">Administrative</a></li> 

        <li><a href="http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="3">Advertising</a></li> 

        <li><a href="http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="4">Airline</a></li> 
    </ul>
</div>

我的代码是:

from scrapy.spider import BaseSpider

from jobfetch.items import JobfetchItem

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose


class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
    allowed_domains = ["jobsearch.about.com/"]
    start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']

    def parse(self, response):
    count = 0
    for sel in response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
        item = JobfetchItem()
        item['title'] = sel.extract()
        item['link'] = sel.extract()
        count = count+1
        print item

    yield item

2 个答案:

答案 0 :(得分:2)

您在代码中遇到的问题:

  • yield item应该在循环内,因为你在那里实例化项目
  • 你拥有的xpath非常混乱而且不太可靠,因为它严重依赖于父标签内的元素位置,并且几乎从文档的顶级父级开始
  • 您的xpath不正确 - 应该转到a
  • 内的li内的ul元素
  • sel.extract()只会为您提供ul元素提取

为了举例,请在此处使用CSS selector转到li代码:

import scrapy

from jobfetch.items import JobfetchItem


class JobFetchSpider(scrapy.Spider):
    name = "Jobsearch"
    allowed_domains = ["jobsearch.about.com/"]
    start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']

    def parse(self, response):
        for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
            item = JobfetchItem()
            item['title'] = sel.xpath('text()').extract()[0]
            item['link'] = sel.xpath('@href').extract()[0]
            yield item

运行蜘蛛产生:

{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}

仅供参考,我们也可以使用xpath()

//article[@itemprop="articleBody"]//div[@class="expert-content-text"]/ul/li/a

答案 1 :(得分:1)

使用以下脚本提取您要抓取的数据。

In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/text()').extract()
Out[1]: 

[u'Accounting',
 u'Administrative',
 u'Advertising',
 u'Airline',
 u'Animal',
 u'Alternative Energy',
 u'Auction House',
 u'Banking',
 u'Biotechnology',
 u'Business',
 u'Business Intelligence',
 u'Chef',
 u'College Admissions',
 u'College Alumni Relations and Development ',
 u'College Student Services',
 u'Construction',
 u'Consulting',
 u'Corporate',
 u'Cruise Ship',
 u'Customer Service',
 u'Data Science',
 u'Engineering',
 u'Entry Level Jobs',
 u'Environmental',
 u'Event Planning',
 u'Fashion',
 u'Film',
 u'First Job',
 u'Fundraiser',
 u'Healthcare/Medical',
 u'Health/Safety',
 u'Hospitality',
 u'Human Resources',
 u'Human Services / Social Work',
 u'Information Technology (IT)',
 u'Insurance',
 u'International Affairs / Development',
 u'International Business',
 u'Investment Banking',
 u'Law Enforcement',
 u'Legal',
 u'Maintenance',
 u'Management',
 u'Manufacturing',
 u'Marketing',
 u'Media',
 u'Museum',
 u'Music',
 u'Non Profit',
 u'Nursing',
 u'Outdoor ',
 u'Public Administration',
 u'Public Relations',
 u'Purchasing',
 u'Radio',
 u'Real Estate ',
 u'Restaurant',
 u'Retail',
 u'Sales',
 u'School',
 u'Science',
 u'Ski and Snow Jobs',
 u'Social Media',
 u'Social Work',
 u'Sports',
 u'Television',
 u'Trades',
 u'Transportation',
 u'Travel',
 u'Yacht Jobs']


In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/@href').extract()

Out[2]: 
[u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm',
 u'http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/animal-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/alternative-energy-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/auction-house-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/banking-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/biotechnology-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/business-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/business-intelligence-job-titles.htm',
 u'http://culinaryarts.about.com/od/culinaryfundamentals/a/whatisachef.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-admissions-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-alumni-relations-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-student-service-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/construction-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/consulting-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/c-level-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/cruise-ship-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/customer-service-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/data-science-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/engineering-job-titles.htm',
 u'http://jobsearch.about.com/od/best-jobs/a/best-entry-level-jobs.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/environmental-job-titles.htm',
 u'http://eventplanning.about.com/od/eventcareers/tp/corporateevents.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/fashion-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/film-job-titles.htm',
 u'http://jobsearch.about.com/od/justforstudents/a/first-job-list.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/fundraiser-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/health-care-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/health-safety-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/hospitality-job-titles.htm',
 u'http://humanresources.about.com/od/HR-Roles-And-Responsibilities/fl/human-resources-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/human-services-social-work-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/it-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/insurance-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/international-affairs-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/international-business-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/investment-banking-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/law-enforcement-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/legal-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/maintenance-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/management-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/manufacturing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/marketing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/media-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/museum-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/music-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/nonprofit-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/nursing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/outdoor-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/public-administration-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/public-relations-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/purchasing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/radio-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/real-estate-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/restaurant-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/retail-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/sales-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/high-school-middle-school-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/science-job-titles.htm',
 u'http://jobsearch.about.com/od/skiandsnowjobs/a/skijob2_2.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/social-media-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/social-work-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/sports-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/television-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/trades-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/transportation-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/travel-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm']