<article> HTML5标记的Xpath

时间:2016-01-16 22:38:10

标签: xml html5 xpath web-scraping google-sheets

我在Google Spreadsheets上使用=importXML功能从不同网站获取一些信息。我正在尝试使用xpath将文本放入<article>标记中。

这是source data

<div id="blog-post-body-ad" class="ad">
    </div>

    <article class="blog-post-body">
        <p>Fox&#39;s <em>X-Men </em>drama <em>Hellfire </em>is making a change at the top.</p>
<p>Writers Evan Katz and Manny Coto, who co-created the drama, are exiting, <em>The Hollywood Reporter </em>has learned. Also out are Patrick McKay and John D. Payne, who came up the the story for the drama alongside Katz and Coto and were set to pen the script. A search is under way for a new writer.</p>
<p>The changes come as <em>Hellfire </em>is on a slower development track, insiders say. <em>Hellfire, </em>which previously was&nbsp;<a href="http://www.hollywoodreporter.com/live-feed/fox-nears-deal-x-men-813542">considered a live-action&nbsp;<em>X-Men</em></a>, follows a young special agent who learns that a power-hungry woman with extraordinary abilities is working with a clandestine society of millionaires &mdash; known as &quot;The Hellfire Club&quot; &mdash; to take over the world.</p>
<p>
    <div class="embedded-content" data-nid="832221" data-nodetype="blog" data-template="readmore">
      <script type="application/json">
        {
          "nid": 832221,
          "type": "blog",
          "title": "Marvel Sets &#039;Legion&#039; Pilot With Noah Hawley at FX, Readying &#039;Hellfire&#039; for Fox",
          "path": "http://www.hollywoodreporter.com/live-feed/marvel-legion-noah-hawley-fx-832221",
          "relative-path": "/live-feed/marvel-legion-noah-hawley-fx-832221"
        }
      </script>
    </div></p>
<p>Sources say the <em>X-Men </em>drama is not likely to go to pilot this season as it remains on a slower track. The change comes as Katz and Coto are shifting their focus to Fox&#39;s <em><a href="http://www.hollywoodreporter.com/live-feed/fox-greenlights-prison-break-event-856203" target="_blank">24: Legacy</a>, </em>which received a formal pilot order Friday during Fox&#39;s time in front of the press at the Television Critics Association&#39;s winter press tour. The new take on 24 will feature an entirely new cast with a diverse lead as Fox has high hopes to reboot the franchise for a new era.</p>
<p>The change at the top should not worry diehard fans of the <em>X-Men </em>franchise. Sources say Fox remains committed to <em>Hellfire </em>and wants to get it completely right as the <em>X-Men </em>franchise remains a valuable asset for the company. Should <em>Hellfire</em> go to series and the network renew Batman prequel <em>Gotham, </em>the network would have dramas from both comic book powerhouses DC Comics and Marvel &mdash; a first for a broadcast network and something insiders would love to see on their schedule.</p>
<p>&nbsp;</p>

        <footer class="blog-post-tags">
                            <a href="/topic/tv-development" data-tracklabel="Story Well - Bottom Tags TV Development">TV Development</a>
                    </footer>
    </article>

    <div class="blog-post-footer-ad">

使用Google Chrome&gt;检查&gt;复制XPath

//*[@id="page-content"]/div[1]/article

我尝试了,但Google表格给了我解析错误

我尝试解决Stack Overflow上的另一个问题但不适合我:

=importXML(C2,"//article[contains(concat('', normalize-space(@class), ''), '')//div[@class='blog-post-body']]")

我想要实现的目标是获取<article>标记内的所有文字 而且BIG plus是在文章中间获取<article>的文本而不包括或排除<div class="embedded-content">

1 个答案:

答案 0 :(得分:1)

这适用于那篇文章:

=concatenate(IMPORTXML("http://www.hollywoodreporter.com/live-feed/foxs-x-men-spinoff-showrunners-856338","//p[3] | //p[4] | //p[5] | //p[6] "))