包含超链接的文本,xpath中元素的顺序

时间:2014-04-18 09:33:23

标签: xpath hyperlink scrapy

我正在使用lxml-3.2.4的scrapy来抓取一些报纸文章。这些文章有时包含HyperLink,并且与网页的其余部分位于网页的不同节点中。 这是这篇文章的链接: http://www.business-standard.com/article/companies/wipro-on-a-major-recruitment-drive-113122300827_1.html

我想提取文章内容,为此我写了这段代码:

hxs = Selector(response)
detailsPath = hxs.xpath('//*[@class="articleContentBox"]')
textall =  detailsPath.xpath('//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*')
for text in textall:
    contents = text.xpath('text()').extract()
    for content in contents:
          data.append(unicodedata.normalize('NFKD',content).encode('ascii','ignore'))
finaltext = "\n".join(data)

我想文章内容是这样的:

Bangalore-based information technology (IT) services firm Wipro is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.
Last year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.
“Campus hiring has always been strategic to Wipro’s hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy,” added Kumar.
According to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with BITS Pilani and Vellore Institute of Technology. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.
“These are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure,” said Kumar.
Kumar said candidates who have completed the programmes would draw more salary than that of an engineer. “They get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,000–6,00,000 per annum,” he said. The average salary an entry level engineer draws is around Rs 3,00,000–Rs  3,50,000 per annum.
“Our experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations,” said Kumar.

但相反文章内容是这样的(超链接中的文字即将结束)

Bangalore-based information technology (IT) services firm 
 is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.
Last year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.
Campus hiring has always been strategic to Wipros hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy, added Kumar.
According to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with 
 and 
. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.
These are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure, said Kumar.
Kumar said candidates who have completed the programmes would draw more salary than that of an engineer. They get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,0006,00,000 per annum, he said. The average salary an entry level engineer draws is around Rs 3,00,000Rs  3,50,000 per annum.
Our experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations, said Kumar.
Wipro
BITS Pilani
Vellore Institute of Technology

请告诉我一种按照它们出现的顺序(最好是在python中)提取元素的方法,从而消除这个问题。提前谢谢。

1 个答案:

答案 0 :(得分:0)

如果您检查XPath //*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*的输出,您会注意到descendant-or-self::*选择了

  • <div itemscope itemtype="http://schema.org/Article">(由于-or-self
  • <p itemprop="articleBody">(上述div的后代)
  • 3 <a class="storyTags" href="...p(以及div
  • 的后代
  • 所有br元素

使用scrapy shell http://www.business-standard.com/article/companies/wipro-on-a-major-recruitment-drive-113122300827_1.html

>>> pprint.pprint(sel.xpath('//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*'))
[<Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<div itemscope itemtype="http://schema.o'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<p itemprop="articleBody">\r\n         \r\n '>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<a class="storyTags" href="/search?type='>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<a class="storyTags" href="/search?type='>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<a class="storyTags" href="/search?type='>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>,
 <Selector xpath='//*[@class="colL_MktColumn2"]/div/div/descendant-or-self::*' data=u'<br>'>]
>>> 

然后,应用.xpath('text()')将从这些元素中提取子文本节点。

div只有白色文字:

>>> sel.xpath('//*[@class="colL_MktColumn2"]/div/div/self::*/text()').extract()
[u'\r\n         ', u'\r\n       ']
>>> 

p包含您想要的大部分内容,但请注意链接中的文字不在那里(链接中的文字是a的子文字节点,而不是p的文字子节点):

>>> import pprint
>>> pprint.pprint(sel.xpath('//*[@class="colL_MktColumn2"]/div/div/p/text()').extract())
[u'\r\n         \r\n    \r\nBangalore-based information technology (IT) services firm ',
 u' is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.',
 u'\r\n',
 u'\r\nLast year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.',
 u'\r\n',
 u'\r\n\u201cCampus hiring has always been strategic to Wipro\u2019s hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy,\u201d added Kumar.',
 u'\r\n',
 u'\r\nAccording to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with ',
 u' and ',
 u'. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.',
 u'\r\n',
 u'\r\n\u201cThese are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure,\u201d said Kumar.',
 u'\r\n',
 u'\r\nKumar said candidates who have completed the programmes would draw more salary than that of an engineer. \u201cThey get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,000\u20136,00,000 per annum,\u201d he said. The average salary an entry level engineer draws is around Rs 3,00,000\u2013Rs\xa0 3,50,000 per annum.',
 u'\r\n',
 u'\r\n\u201cOur experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations,\u201d said Kumar.']
>>> 

最后,a元素的文本节点:

>>> pprint.pprint(sel.xpath('//*[@class="colL_MktColumn2"]/div/div//a/text()').extract())
[u'Wipro', u'BITS Pilani', u'Vellore Institute of Technology']
>>> 

br个元素没有子文本节点

>>> sel.xpath('//*[@class="colL_MktColumn2"]/div/div//br/text()').extract()
[]
>>> 

一种解决方案是使用<p itemprop="articleBody">

提取string()的文本表示
>>> print(sel.xpath('string(//*[@class="colL_MktColumn2"]/div/div/p)').extract()[0])



Bangalore-based information technology (IT) services firm Wipro is on a major recruitment drive. The company would evaluate 50,000-60,000 students from 350 colleges in FY15.

Last year, Wipro had offered letters to 3,000-4,000 students with science background, and the company plans to hire more in FY15, said Rajiv Kumar, global campus head, Wipro. Kumar did not reveal the exact number of recruitments.

“Campus hiring has always been strategic to Wipro’s hiring strategy. But other than hiring engineers, we have been hiring students from science background in good numbers through two of our programmes -- Wipro Academy of Software Engineers and Wipro Software Technology Academy,” added Kumar.

According to sources, about 5,000 students were inducted and another 11,000 are in the process of joining the company through these programmes. The programmes had been launched in partnership with BITS Pilani and Vellore Institute of Technology. During the traning, the company takes care of the fee, books and accomodation. Besides, students are given a stipend of about Rs 12,000 in the first year and goes up to 20,000 in the fourth year.

“These are true earn-as-you-learn programmes. After four years, their career paths are similar to any engineer. They can start as developer, project manager etc. More importantly, we do not sign any bond with the student. So after the fourth year, if a candidate wishes to leave Wipro, they can. The only condition is that they have to complete the four-year tenure,” said Kumar.

Kumar said candidates who have completed the programmes would draw more salary than that of an engineer. “They get paid more than an entry level engineering candidate. It is generally in the range of Rs 4,00,000–6,00,000 per annum,” he said. The average salary an entry level engineer draws is around Rs 3,00,000–Rs  3,50,000 per annum.

“Our experience tells us that the attrition rate in this group is in single digits: Much lower than the company average. Also, we do not hire these students for our BPO operations,” said Kumar.
>>>