这是我必须在工作细节下提取内容的链接https://www.google.com/about/careers/search#!t=jo&jid=34154&。
Job details
Team or role: Software Engineering // How to write xapth
Job type: Full-time // How to write xapth
Last updated: Oct 17, 2014 // How to write xapth
Job location(s):Seattle, WA, USA; Kirkland, WA, USA //// How to write rejax for to extract city, state and country separately for each jobs. Also i need to filter USA, canada and UK jobs separately.
这里我添加了html代码以提取上述内容:
<div class="detail-content">
<div>
<div class="greytext info" style="display: inline-block;">Team or role:</div>
<div class="info-text" style="display: inline-block;">Software Engineering</div> // How to write xpath for this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job type:</div>
<div class="info-text" style="display: inline-block;" itemprop="employmentType">Full-time</div>// How to write xpath for job type this one
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Job level:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div style="display: none;" aria-hidden="true">
<div class="greytext info" style="display: inline-block;">Salary:</div>
<div class="info-text" style="display: inline-block;"></div>
</div>
<div>
<div class="greytext info" style="display: inline-block;">Last updated:</div>
<div class="info-text" style="display: inline-block;" itemprop="datePosted"> Oct 17, 2014</div> // How to write xpath for posted date this one
</div>
<div>
<div class="greytext info" style="display: inline-block;">Job location(s):</div>
<div class="info-text" style="display: inline-block;">Seattle, WA, USA; Kirkland, WA, USA</div> // How to write rejax for to extract city, state and country seprately
</div>
</div>
</div>
这是蜘蛛代码:
def parse_listing_page(self,response):
selector = Selector(response)
item=googleSpiderItem()
item['CompanyName'] = "Google"
item ['JobDetailUrl'] = response.url
item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('(.)\,.')
item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract().re('\,(.)')
item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()
Description = selector.xpath("string(//div[@itemprop='description'])").extract()
item['Description'] = [d.encode('UTF-8') for d in Description]
print "Done!"
yield item
输出是:
Traceback (most recent call last):
File "/usr/lib64/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/usr/lib64/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 26, in process_spider_output
for x in result:
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib64/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/sureshp/Downloads/wwwgooglecom/wwwgooglecom/spiders/googlepage.py", line 49, in parse_listing_page
> **item['City'] = selector.xpath("//a[@class='source
> sr-filter']/span[@itemprop='name']/text()").extract().re('(.*)\,.')
> exceptions.AttributeError: 'list' object has no attribute 're'**
答案 0 :(得分:1)
我注意到你在解析代码中有一些拼写错误。
我修好了。现在输出是。
{'City': [u'Seattle, WA, USA', u'Kirkland, WA, USA'],
'CompanyName': 'Google',
'Description': [u"Google's software engineers develop the next-generation technologies that change how millions of users connect, explore, and interact with information and one another. Our ambitions reach far beyond just Search. Our products need to handle information at the the scale of the web. We're looking for ideas from every area of computer science, including information retrieval, artificial intelligence, natural language processing, distributed computing, large-scale system design, networking, security, data compression, and user interface design; the list goes on and is growing every day. As a software engineer, you work on a small team and can switch teams and projects as our fast-paced business grows and evolves. We need our engineers to be versatile and passionate to tackle new problems as we continue to push technology forward.?\nWith your technical expertise you manage individual projects priorities, deadlines and deliverables. You design, develop, test, deploy, maintain, and enhance software solutions.\n\nSeattle/Kirkland engineering teams are involved in the development of several of Google?s most popular products: Cloud Platform, Hangouts/Google+, Maps/Geo, Advertising, Chrome OS/Browser, Android, Machine Intelligence. Our engineers need to be versatile and willing to tackle new problems as we continue to push technology forward."],
'JobDetailUrl': 'https://www.google.com/about/careers/search?_escaped_fragment_=t%3Djo%26jid%3D34154%26',
'Jobtype': [],
'State': [u'Seattle, WA, USA', u'Kirkland, WA, USA'],
'Title': [u'Software Engineer']}
这里是修改后的代码:
from scrapy.spider import Spider
from scrapy.selector import Selector
from Google.items import GoogleItem
import re
class DmozSpider(Spider):
name = "google"
allowed_domains = ["google.com"]
start_urls = [
"https://www.google.com/about/careers/search#!t=jo&jid=34154&",
]
def parse(self, response):
selector = Selector(response)
item=GoogleItem()
item['Description'] = selector.xpath("string(//div[@itemprop='description'])").extract()
item['CompanyName'] = "Google"
item ['JobDetailUrl'] = response.url
item['Title'] = selector.xpath("//a[@class='heading detail-title']/span[@itemprop='name title']/text()").extract()
item['City'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()
item['State'] = selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract()
item['Jobtype'] = selector.xpath(".//*[@id='75015001']/div[2]/div[7]/div[2]/div[5]/div[2]/text()").extract()
yield item
对于单独的城市,州和国家,您可以在选择器上使用循环:
for p in selector.xpath("//a[@class='source sr-filter']/span[@itemprop='name']/text()").extract():
city,state,nation= p.split(',')
item['City'] = city
item['State'] = state
item['Nation'] = nation