我想从此网址的content
代码中提取<div>
;
http://www.ireland.com/en-gb/destinations/republic-of-ireland/sligo/articles/mullaghmore
使用inspect元素,我要在HTML中提取的信息是
<div class="content-panel">
,此容器的XPath是:
//*[@id="PageContent_page_container"]/section/div[2]
这是页面的相关部分:
<body>
...
<div id="PageContent_page_container" class="page-container">
<section class="page editorial" ... >
<div id="PageContent_content__1149779ca5de748_HeroPanel" class="hero-panel">...</div>
<div class="content-panel"> <!-- THIS IS THE DIV SELECTED -->
<div class="content">
<p id="PageContent_content__1149779ca5de748_Introduction" class="intro">This is Yeats Country. This is surfer country. This is a country of castles, seaside, and Ireland's very own Table Mountain. This is Mullaghmore, County Sligo</p>
<p>Jutting out of<a href="/en-gb/destinations/republic-of-ireland/sligo/articles/sligo"> Sligo&rsquo;</a>s northern edge, close to the county&rsquo;s border with Donegal, the small peninsula of <a href="/en-gb/destinations/republic-of-ireland/sligo/articles/mullaghmore">Mullaghmore</a> sits dramatically out into the North Atlantic.&nbsp;</p>
<p>The waters here are not simply photogenic. They have become known for some of the most sought-after waves in surfing. Mullahgmore is notably championed for one big break in particular which <em>Surfing magazine</em> has dubbed &ldquo;a mutant Irish left&rdquo;. Surfing is in the blood here. The famous Irish pro-surfer and local Sligo legend, Easkey Britton, was even named after a beach called Easkey, just an hour&rsquo;s drive further south of Mullaghmore. While you&rsquo;re in the neighbourhood, why not head down to Strandhill and indulge in a indulge in the traditional Irish therapy/detox treatment of a warm seaweed bath courtesy of Voya Seaweed Baths.&nbsp;</p>
<h3>Anyone for golf?</h3>
<p>Curving around to create a natural bay, the peninsula&rsquo;s eastern coast stretches into an elegant sweep. From here, you're looking up along <a href="/en-gb/destinations/republic-of-ireland/donegal/articles/donegal">Donegal</a>&rsquo;s southern borders at <a href="/en-gb/what-is-available/golf/golf-courses/destinations/republic-of-ireland/donegal/bundoran/all/1-365" target="_self">Bundoran Golf Club</a> and the point where the River Erne flows into the Atlantic. </p>
<h3>By the mountains and sea</h3>
<p>Also on this eastern side, sits the tiny village of Mullaghmore overlooked by two of Sligo&rsquo;s icons. The first is <a href="/en-gb/what-is-available/natural-landscapes-and-sights/natural-landscapes/destinations/republic-of-ireland/sligo/sligo-town/all/1-87761" target="_self">Ben Bulben mountain</a>, part of the Dartry Mountains, a range shared by both Sligo and its neighbour <a href="/en-gb/destinations/republic-of-ireland/leitrim/articles/leitrim">Leitrim</a>. Ben Bulben sits on Sligo&rsquo;s coast surging out towards the North Atlantic and shadowing the village of Mullaghmore. </p>
<h3>A poet&rsquo;s land</h3>
<p>For many, Sligo is considered Yeats Country. For a poet so concerned with his home county and especially its landscape, there was no escaping Ben Bulben. The mountain&rsquo;s most noted reference in Yeats&rsquo;s poetry is in the work <em>Under Ben Bulben</em>, in which he describes horsemen who &ldquo;ride the wintry dawn/Where Ben Bulben sets the scene".</p>
<h3>Walk in the wild</h3>
<p>For those wishing to become more intimately acquainted with the mountain, the Ben Bulben (Gortarowey) Looped Walk is a 4km (2.5 mile) route of easy-going terrain and some minor ascents. For a more thorough on-foot exploration of Mullaghmore, set off on the 8km (5 mile) beach and pier walk along Bunduff Strand. </p>
<div class="in-page-carousel">
<div class="content">
<ul>
<li>
<figure>
<img id="Page..._CarouselImage_0" src="http://.../sdp4_mullaghmore_car-1.jpg" alt="Ben Bulben Mountain" style="height:323px;width:571px;" />
<figcaption id="Page..._CarrouselCaptionSection_0">Ben Bulben Mountain</figcaption>
</figure>
</li>
<li>
<figure>
<img id="Page..._CarouselImage_1" src="http://.../sdp4_mullaghmore_car-2.jpg" alt="..." style="height:323px;width:571px;" />
<figcaption id="Page..._CarrouselCaptionSection_1">Classiebawn Castle provided by <a href="http://..." >Patryk Kosmider</a> </figcaption>
</figure>
</li>
<li>
<figure>
<img id="Page..._CarouselImage_2" src="http://.../sdp4_mullaghmore_car-3.jpg" style="height:323px;width:571px;" />
<figcaption id="Page..._CarrouselCaptionSection_2"></figcaption>
</figure>
</li>
</ul>
</div>
</div>
<aside class="related-items">
<h2>Related providers</h2>
<ul>
<li><a id="Page..._Link_0" href="/.../1-62182">Strandhill Surf School</a></li>
<li><a id="Page..._Link_1" href="/.../1-82855">Offshore Watersports</a></li>
<li><a id="Page..._Link_2" href="/.../1-91302">Mullaghmore Head - Wild Atlantic Way</a></li>
<li><a id="Page..._Link_3" href="/.../1-9850">Mullaghmore Sailing Club and Centre Ltd</a></li>
<li><a id="Page..._Link_4" href="/.../1-87095">Yeats Country Hotel, Spa and Leisure Club</a></li>
</ul>
</aside>
<p>Few images of Mullaghmore, and for that matter Sligo, will fail to include Classiebawn Castle. Sitting in a modest rise in an evergreen spread of field about a hundred metres from the sea, there&rsquo;s an air of Disney whimsy about Classiebawn. Instantly recognisable by its conical turret, building of the castle was begun by the British statesman. Classiebawn is privately owned, but well worth a visit before you set back on your Wild Atlantic Way journey.</p>
<p><strong>Geographical coordinates:</strong> Latitude: 54.465546; Longitude: -8.449455</p>
</div>
</div>
</section>
</div>
...
我在谷歌文档中尝试了许多不同的XPath组合来返回文本,但我没有收到<div class="content-panel">
内的所有文本内容。
如果我使用表达式//div/p/text()
,所有文本都会作为单独的行返回,但我需要在单个字段中的每个<p>
标记中包含所有文本。
如果有人可以建议那会很棒。谢谢。 阿里
答案 0 :(得分:0)
如果您只想要所有文字内容,可以将表达式括在string()
:
string(//*[@id="PageContent_page_container"]/section/div[2])
或normalize-space()
如果你想摆脱额外的空间:
normalize-space(//*[@id="PageContent_page_container"]/section/div[2])
如果要选择包含文本的单个元素,可以从上下文节点向它们添加额外的位置步骤以获取节点集(从中可以提取单个节点),例如:
1)第二段中的文字:
string(//*[@id="PageContent_page_container"]/section/div[2]//p[2])
2)包含所有<h3>
和<h2>
标题的节点集(您可以循环并提取每个标题的文本):
//*[@id="PageContent_page_container"]/section/div[2]//*[name() = 'h3' or name() = 'h2']