查询XPath表达式以返回<div class =“content-panel”> </div>

时间:2014-05-22 20:40:42

标签: html xpath

我想从此网址的content代码中提取<div>;

http://www.ireland.com/en-gb/destinations/republic-of-ireland/sligo/articles/mullaghmore

使用inspect元素,我要在HTML中提取的信息是 <div class="content-panel">,此容器的XPath是:

//*[@id="PageContent_page_container"]/section/div[2]

这是页面的相关部分:

<body>
    ...
    <div id="PageContent_page_container" class="page-container">
        <section class="page editorial" ... >

            <div id="PageContent_content__1149779ca5de748_HeroPanel" class="hero-panel">...</div>

            <div class="content-panel"> <!-- THIS IS THE DIV SELECTED -->
                <div class="content">
                    <p id="PageContent_content__1149779ca5de748_Introduction" class="intro">This is Yeats Country. This is surfer country. This is a country of castles, seaside, and Ireland's very own Table Mountain. This is Mullaghmore, County Sligo</p>
                    <p>Jutting out of<a href="/en-gb/destinations/republic-of-ireland/sligo/articles/sligo"> Sligo&amp;rsquo;</a>s northern edge, close to the county&amp;rsquo;s border with Donegal, the small peninsula of <a href="/en-gb/destinations/republic-of-ireland/sligo/articles/mullaghmore">Mullaghmore</a> sits dramatically out into the North Atlantic.&amp;nbsp;</p>
                    <p>The waters here are not simply photogenic. They have become known for some of the most sought-after waves in surfing. Mullahgmore is notably championed for one big break in particular which <em>Surfing magazine</em> has dubbed &amp;ldquo;a mutant Irish left&amp;rdquo;. Surfing is in the blood here. The famous Irish pro-surfer and local Sligo legend, Easkey Britton, was even named after a beach called Easkey, just an hour&amp;rsquo;s drive further south of Mullaghmore. While you&amp;rsquo;re in the neighbourhood, why not head down to Strandhill and indulge in a indulge in the traditional Irish therapy/detox treatment of a warm seaweed bath courtesy of Voya Seaweed Baths.&amp;nbsp;</p>
                    <h3>Anyone for golf?</h3>
                    <p>Curving around to create a natural bay, the peninsula&amp;rsquo;s eastern coast stretches into an elegant sweep. From here, you're looking up along <a href="/en-gb/destinations/republic-of-ireland/donegal/articles/donegal">Donegal</a>&amp;rsquo;s southern borders at <a href="/en-gb/what-is-available/golf/golf-courses/destinations/republic-of-ireland/donegal/bundoran/all/1-365" target="_self">Bundoran Golf Club</a> and the point where the River Erne flows into the Atlantic. </p>
                    <h3>By the mountains and sea</h3>
                    <p>Also on this eastern side, sits the tiny village of Mullaghmore overlooked by two of Sligo&amp;rsquo;s icons. The first is <a href="/en-gb/what-is-available/natural-landscapes-and-sights/natural-landscapes/destinations/republic-of-ireland/sligo/sligo-town/all/1-87761" target="_self">Ben Bulben mountain</a>, part of the Dartry Mountains, a range shared by both Sligo and its neighbour <a href="/en-gb/destinations/republic-of-ireland/leitrim/articles/leitrim">Leitrim</a>. Ben Bulben sits on Sligo&amp;rsquo;s coast surging out towards the North Atlantic and shadowing the village of Mullaghmore. </p>
                    <h3>A poet&amp;rsquo;s land</h3>
                    <p>For many, Sligo is considered Yeats Country. For a poet so concerned with his home county and especially its landscape, there was no escaping Ben Bulben. The mountain&amp;rsquo;s most noted reference in Yeats&amp;rsquo;s poetry is in the work <em>Under Ben Bulben</em>, in which he describes horsemen who &amp;ldquo;ride the wintry dawn/Where Ben Bulben sets the scene".</p>
                    <h3>Walk in the wild</h3>
                    <p>For those wishing to become more intimately acquainted with the mountain, the Ben Bulben (Gortarowey) Looped Walk is a 4km (2.5 mile) route of easy-going terrain and some minor ascents. For a more thorough on-foot exploration of Mullaghmore, set off on the 8km (5 mile) beach and pier walk along Bunduff Strand. </p>
                    <div class="in-page-carousel">
                        <div class="content">
                            <ul>
                                <li>
                                    <figure>
                                        <img id="Page..._CarouselImage_0" src="http://.../sdp4_mullaghmore_car-1.jpg" alt="Ben Bulben Mountain" style="height:323px;width:571px;" />
                                        <figcaption id="Page..._CarrouselCaptionSection_0">Ben Bulben Mountain</figcaption>
                                    </figure>
                                </li>
                                <li>
                                    <figure>
                                        <img id="Page..._CarouselImage_1" src="http://.../sdp4_mullaghmore_car-2.jpg" alt="..." style="height:323px;width:571px;" />
                                        <figcaption id="Page..._CarrouselCaptionSection_1">Classiebawn Castle provided by <a href="http://..." >Patryk Kosmider</a> </figcaption>
                                    </figure>
                                </li>
                                <li>
                                    <figure>
                                        <img id="Page..._CarouselImage_2" src="http://.../sdp4_mullaghmore_car-3.jpg" style="height:323px;width:571px;" />
                                        <figcaption id="Page..._CarrouselCaptionSection_2"></figcaption>
                                    </figure>
                                </li>
                            </ul>
                        </div>
                    </div>
                    <aside class="related-items">
                        <h2>Related providers</h2>
                        <ul>
                            <li><a id="Page..._Link_0" href="/.../1-62182">Strandhill Surf School</a></li>
                            <li><a id="Page..._Link_1" href="/.../1-82855">Offshore Watersports</a></li>
                            <li><a id="Page..._Link_2" href="/.../1-91302">Mullaghmore Head - Wild Atlantic Way</a></li>
                            <li><a id="Page..._Link_3" href="/.../1-9850">Mullaghmore Sailing Club and Centre Ltd</a></li>
                            <li><a id="Page..._Link_4" href="/.../1-87095">Yeats Country Hotel, Spa and Leisure Club</a></li>
                        </ul>
                    </aside>
                    <p>Few images of Mullaghmore, and for that matter Sligo, will fail to include Classiebawn Castle. Sitting in a modest rise in an evergreen spread of field about a hundred metres from the sea, there&amp;rsquo;s an air of Disney whimsy about Classiebawn. Instantly recognisable by its conical turret, building of the castle was begun by the British statesman. Classiebawn is privately owned, but well worth a visit before you set back on your Wild Atlantic Way journey.</p>
                    <p><strong>Geographical coordinates:</strong> Latitude: 54.465546; Longitude: -8.449455</p>
                </div>
            </div>
        </section>
    </div>
    ...

我在谷歌文档中尝试了许多不同的XPath组合来返回文本,但我没有收到<div class="content-panel">内的所有文本内容。

如果我使用表达式//div/p/text(),所有文本都会作为单独的行返回,但我需要在单个字段中的每个<p>标记中包含所有文本。

如果有人可以建议那会很棒。谢谢。 阿里

1 个答案:

答案 0 :(得分:0)

如果您只想要所有文字内容,可以将表达式括在string()

string(//*[@id="PageContent_page_container"]/section/div[2])

normalize-space()如果你想摆脱额外的空间:

normalize-space(//*[@id="PageContent_page_container"]/section/div[2])

如果要选择包含文本的单个元素,可以从上下文节点向它们添加额外的位置步骤以获取节点集(从中可以提取单个节点),例如:

1)第二段中的文字:

string(//*[@id="PageContent_page_container"]/section/div[2]//p[2])

2)包含所有<h3><h2>标题的节点集(您可以循环并提取每个标题的文本):

//*[@id="PageContent_page_container"]/section/div[2]//*[name() = 'h3' or name() = 'h2']