在XML中使用XPath来废弃具有不同值的节点

时间:2014-02-08 10:24:56

标签: xml r xpath

我正在使用R中非常有用的XML包来抓取网页。我是XPath的初学者,我从w3 schools website学习了它的基础知识。我想选择一个具有变量值的属性的节点,我目前无法有效地执行此操作。以下显示了我的代码和遇到的问题:

require(XML)
myUrl<- "http://www.expatforum.com/expats/uae-expat-forum-expats-living-uae/336985-visa-overstay.html"
extracted<- htmlParse(myUrl)
#This parses the HTML data, and a snippet from it is shown as follows

<td class="alt1" id="td_post_3081025">

<!-- message, attachments, sig -->




        <!-- icon and title -->
        <div class="smallfont">
            <img class="inlineimg" src="http://www.expatforum.com/expats/images/icons/icon8.gif" alt="Angry" border="0" />
            <strong>Visa Overstay</strong>
        </div>
        <hr size="1" style="color:#068200; background-color:#068200" />
        <!-- / icon and title -->


    <!-- message -->
    <div id="post_message_3081025">

        Looking for advice for a complicated situation. I am currently in the UAE working as a teacher with a valid visa. My boyfriend has been living here for 10 years with a valid visa until 2013. There was a discrepency between him and his sponsor ($$$) and his visa was canceled without his knowledge.He was called into the police stattion without even knowing that there was an issue with his visa. He went willingly because he had nothing to hide.  He was arrested and jailed for about a month then told he had 3 months to &quot;fix&quot; his problem. The issue has been in the labor courts since then and he is currently living here without a visa (for over a year now). He has called his sponsor and gone to the ministry of labor countless times and no one gives him a direct answer about what he can do to get the block off of his name but no one has arrested him since the initial incident. His sponsor says that he no longer cares and that he would take the block off his name but it is already in the labor courts so there's technically nothing they can do. He wants to turn himself in so that he can pay the overstay charges or do jail  time and either reapply for another visa or go somewhere else but is country of origin is Syria and he is scared that they will send him there without any other safe options. Would someone be able to choose where they fly after facing overstay jail time? Is this criteria for deportation and the inability to reapply for another visa here in the UAE? Does anyone know how this process works? It's a scary situation and he needs it to be resolved so that he can begin living his life again.
    </div>
    <!-- / message -->

现在,我想提取与<div id="post_message_3081025">标记中包含的帖子相关的数据。看起来似乎可以使用//div[@id]轻松实现。但是,在完整的文件中还有其他节点和属性,它们都是&#39; div id&#39;。

我认为唯一的解决方案是以某种方式选择id属性的值。 但同样,该值的数字部分也各不相同。我尝试使用//div[@id='post_message_*'],但它没有用。

目前,我采用了更长,效率更低的方法,使用as(x,"character")使用grepl("^div id='post_message'",x)转换此数据,然后使用gsub()删除不必要的比特。

但请问有更好的方法吗?

感谢您的时间。

2 个答案:

答案 0 :(得分:1)

您可以使用starts-with

//div[starts-with(@id, "post_message")]

答案 1 :(得分:1)

我正在添加基于selectr包的另一个答案,它允许用户使用css选择器而不是xpath查询html文档。我发现css选择器更容易grep。

library(selectr)
querySelector(extracted, 'div[id^=post_message]')

我们正在寻找ID以post_message开头的div。