我正在抓取一些内容,但我想排除一些元素。例如,从主div id =“ Introduction”中,我只想抓取h2和2段落,而排除span class =“ section_edit_link”和div class =“ photo_container”。我当然可以提取想要的元素并加入它们,但是由于每个部分都包含我要排除的这2个元素,因此有什么方法可以在xpath上排除它们吗?
<div id="Introduction"><span class="section_edit_link"><a href="/wiki_edit.cfm?title=Seoul&section=Introduction" title="Edit section: Introduction" rel="nofollow">edit</a> </span>
<h2>Introduction</h2>
<div class="photo_container">
<a href="https://www.travellerspoint.com/photos/stream/photoID/80/features/countries/South Korea/"><img src="https://photos.travellerspoint.com/8818/thumb_dhessel_seoul.jpg" width="200" height="146" alt="Night time traffic in Seoul" class="photo"></a>
<h4>Night time traffic in Seoul</h4>
<p>© All Rights Reserved <a href="https://www.travellerspoint.com/users/Hessell/">Hessell</a></p>
</div>
<p><strong>Seoul</strong> (서울) is the heart of <a href="http://www.travellerspoint.com/guide/South_Korea/">South Korea</a>, hosting about a quarter of the country's population of nearly 50 million. Seoul was also the historic capital of Korea from the 14th century until the nation's partition into <a href="http://www.travellerspoint.com/guide/North_Korea/">North</a> and <a href="http://www.travellerspoint.com/guide/South_Korea/">South</a> in 1948. Located just 50 kilometres south of the North Korean border, Seoul symbolises the division of North and South Korea. </p>
<p>Seoul enjoys a lively nightlife, which has earned it comparisons with <a href="http://www.travellerspoint.com/guide/Tokyo/">Tokyo</a>. Thankfully though, Seoul is much cheaper than the <a href="http://www.travellerspoint.com/guide/Japan/">Japanese</a> capital.</p>
答案 0 :(得分:0)
如果您的Introduction div仅包含上述问题中所示的元素,则以下内容应会为您提供所需的结果:
yield{
'heading': response.css('#Introduction > h2').extract_first(),
'para 1': response.css('#Introduction > p').extract_first(),
'para 1': response.css('#Introduction > p:last-child').extract_first(),
}