On: http://web.unep.org/inquiry/news中提取文字,我想成为头条新闻。基于firefox Xpathchecker,xpath为//div[@class='highlighter']/a
(参见http://i.stack.imgur.com/DeuG5.png)
但是下面的代码给了我空行:
import scrapy
from unepinquiry.items import unepinquiryitem
class unepInquirySpider(scrapy.Spider):
name = "unepinquiry"
allowed_domains = ["web.unep.org"]
start_urls = ["http://web.unep.org/inquiry/news"]
def parse(self, response):
for sel in response.xpath('//div[@class="highlighter"]'):
item = unepinquiryitem()
item['title'] = sel.xpath('/a/text()').extract()
yield item
答案 0 :(得分:0)
关于这些<div class="highlighter">
元素中的链接,网站的HTML标记不正确:
<div class="highlighter">
<a target="_blank" href="http://unep.org/newscentre/Default.aspx?DocumentID=27071&ArticleID=36155&l=en" />
New Report Shows How India Can Scale up Sustainable Finance</a>
<p><strong>Mumbai, 29 April 2016</strong> - India has set ambitious goals for inclusive and sustainable development, which require the mobilization of additional low-cost, long-term capital. A new report launched today by the United Nations Environment Programme (UNEP) and the Federation of Indian Chambers of Commerce and Industry (FICCI) shows how the country is already introducing innovative approaches to attract private capital for green assets - and outlines a number of key steps to deepen this process in India.</p>
<p>The report, entitled <a href="http://unepinquiry.org/wp-content/uploads/2016/04/Delivering_a_Sustainable_Financial_System_in_India.pdf">Delivering a Sustainable Financial System in India</a> profiles the actions that have been taken to advance environmental and social factors as a core part of India's banking, capital markets, investment and insurance sectors. It was jointly produced by FICCI and the UNEP Inquiry into the Design of a Sustainable Financial System, backed by a high-level India advisory council.</p>
<div class="clear"></div>
</div>
关闭链接标签(向右滚动)
<a target="_blank"
href="http://unep.org/newscentre/Default.aspx?DocumentID=27071&ArticleID=36155&l=en" />
^
|
here
标题有一个尾随结束</a>
标记:
New Report Shows How India Can Scale up Sustainable Finance</a>
这会给lxml
解析器带来麻烦(由引擎盖下的scrapy使用)
您可以通过打印每个<div>
(在其上调用.extract()
以进行HTML序列化)来检查Scrapy如何“看到”HTML:
>>> for div in response.xpath('//div[@class="highlighter"]'):
... print("-------------")
... print(div.extract())
...
-------------
<div class="highlighter">
<a target="_blank" href="http://unep.org/newscentre/Default.aspx?DocumentID=27071&ArticleID=36155&l=en"></a>
New Report Shows How India Can Scale up Sustainable Finance
<p><strong>Mumbai, 29 April 2016</strong> - India has set ambitious goals for inclusive and sustainable development, which require the mobilization of additional low-cost, long-term capital. A new report launched today by the United Nations Environment Programme (UNEP) and the Federation of Indian Chambers of Commerce and Industry (FICCI) shows how the country is already introducing innovative approaches to attract private capital for green assets - and outlines a number of key steps to deepen this process in India.</p>
<p>The report, entitled <a href="http://unepinquiry.org/wp-content/uploads/2016/04/Delivering_a_Sustainable_Financial_System_in_India.pdf">Delivering a Sustainable Financial System in India</a> profiles the actions that have been taken to advance environmental and social factors as a core part of India's banking, capital markets, investment and insurance sectors. It was jointly produced by FICCI and the UNEP Inquiry into the Design of a Sustainable Financial System, backed by a high-level India advisory council.</p>
<div class="clear"></div>
</div>
-------------
<div class="highlighter">
<a target="_blank" href="http://unep.org/newscentre/default.aspx?DocumentID=27071&ArticleID=36139"></a>
Green Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth
<p><strong>Washington, D.C., 16 April 2016</strong> – The Paulson Institute and the Green Finance Committee of China Society for Finance and Banking convened a half-day symposium of global finance leaders and experts to discuss recommendations for the development of robust global green finance mechanisms and markets. The recommendations coming out of the meetings will be provided to the G20 Green Finance Study Group, which is chaired by the People’s Bank of China and the Bank of England. The study group will finalize a synthesized report for the G20. SIFMA, Bloomberg Philanthropies and United Nations Environment Programme also co-hosted the event.</p>
<div class="clear"></div>
</div>
(...)
因此,您可以看到您所追求的标题在<a>
之后成为文本节点。
在XPath中,可以使用following-sibling
轴访问它。因此,对于每个<a>
,following-sibling::text()
将选择在{text()
作为“节点测试”之后的文本节点,在HTML树中的同一级别(“兄弟”):< / p>
>>> for div in response.xpath('//div[@class="highlighter"]'):
... item = {}
... item['title'] = div.xpath('./a/following-sibling::text()').extract()
... print(item)
...
{'title': ['\nNew Report Shows How India Can Scale up Sustainable Finance\n', '\n', '\n\n', '\n']}
{'title': ['\nGreen Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth\n', '\n\n', '\n']}
(...)
{'title': ['\nThe Inquiry Speaks at PRI in Person in London, UK\n', '\n\n', '\n']}
{'title': ['\nReshaping Finance for Sustainability\n', '\n\n', '\n']}
>>>
您可以看到following-sibling::text()
也与其他一些文本节点匹配:'\n\n', '\n'
。
您可以使用XPath表达式末尾的[1]
谓词来删除它们,以选择第一个匹配项:
>>> for div in response.xpath('//div[@class="highlighter"]'):
... item = {}
... item['title'] = div.xpath('./a/following-sibling::text()[1]').extract()
... print(item)
...
{'title': ['\nNew Report Shows How India Can Scale up Sustainable Finance\n']}
{'title': ['\nGreen Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth\n']}
(...)
{'title': ['\nThe Inquiry Speaks at PRI in Person in London, UK\n']}
{'title': ['\nReshaping Finance for Sustainability\n']}
>>>
您也可以使用.extract_first()
仅获取第一个元素,而不是列表:
>>> for div in response.xpath('//div[@class="highlighter"]'):
... item = {}
... item['title'] = div.xpath('./a/following-sibling::text()').extract_first()
... print(item)
...
{'title': '\nNew Report Shows How India Can Scale up Sustainable Finance\n'}
{'title': '\nGreen Finance Symposium Explores Financial Mechanisms to Promote Low-Carbon Global Economic Growth\n'}
(...)
{'title': '\nThe Inquiry Speaks at PRI in Person in London, UK\n'}
{'title': '\nReshaping Finance for Sustainability\n'}