使用R-xpathSApply刮取HTML

时间:2015-02-05 17:44:53

标签: html r xpath screen-scraping

我想使用他们的类或xpathSapply从下面的html代码中提取信息。

我想将不同的信息作为表格捕获,例如

  • 作为填充了5和完整评论
  • 的列的有效性

而不是

  • 截断的一个作为填充完整注释的列。

<div class="userPost">
<div class="postHeading clearfix">
  <div class="conditionInfo">
                Condition: Condition in which Stomach Acid is Pushed Into the Esophagus</div>
  <div class="date">8/12/2014 12:27:53 PM</div>
</div>
<p class="reviewerInfo">Reviewer: Believer, 35-44 Female  on Treatment for 2 to less than 5 years (Patient) </p>
<div id="ctnStars">
  <div class="catRatings firstEl clearfix">
    <p class="category">Effectiveness</p>
    <p class="inlineRating starRating"><span class="current-rating" style="width: 100%">
        Current Rating: 5</span></p>
  </div>
  <div class="catRatings clearfix">
    <p class="category">Ease of Use</p>
    <p class="inlineRating starRating"><span class="current-rating" style="width: 100%">
        Current Rating: 5</span></p>
  </div>
  <div class="catRatings lastEl clearfix">
    <p class="category">Satisfaction</p>
    <p class="inlineRating starRating"><span class="current-rating" style="width: 100%">
        Current Rating: 5</span></p>
  </div>
</div>
<p id="comTrunc1" class="comment"><strong>Comment: </strong><br>Most excellent! I tried several different rx&#39;s to help with my acid problem and none were as effective as Nexium. After being on it for 3 months I stopped because that was how long my doc thought it would take to heal me.  I stopped taking it and boom, the pain was back.  Got back on Nexium and am staying on it. Such relief was unexpected.</p>
<p id="comFull1" class="comment" style="display:none"><strong>Comment:</strong><br>Most excellent! I tried several different rx&#39;s to help with my acid problem and none were as effective as Nexium. After being on it for 3 months I stopped because that was how long my doc thought it would take to heal me.  I stopped taking it and boom, the pain was back.  Got back on Nexium and am staying on it. Such relief was unexpected.<br><a onclick="toggle('comTrunc1'); toggle('comFull1');return false;" href="#">Hide Full Comment</a></p>
<div class="actionLinks clearfix">
  <p class="helpful">4
                        people

                found this review helpful.<br>
                Was this review helpful?  <span id="513102_Vote"><a href="#" onclick="return FoundHelpFul('8cbc5bf1-4f86-48e4-ac0f-5b3085949a2a', 513102, true)">Yes</a> | <a href="#" onclick="return FoundHelpFul('8cbc5bf1-4f86-48e4-ac0f-5b3085949a2a', 513102, false)">No</a></span></p><a class="reportAbuse" href="#" onclick="showPopWin('ReportAbuse.aspx?reviewid=513102&amp;userid=8cbc5bf1-4f86-48e4-ac0f-5b3085949a2a',400,160,null, false); return false;">Report This Post</a></div>

1 个答案:

答案 0 :(得分:0)

我不清楚你在做什么,但这是一个开始。如果这不是你想要的方向,请在尝试这些方面后编辑你的问题(并包括你的代码)。假设&#34; url&#34;是您从中获得HTML代码的网站网址,请尝试以下内容:

library(xml)
doc <- htmlTreeParse(url) # reads into the object doc the contents of the url

data <- xpathSApply(doc, "//div[@id = 'ctnStars']//[[@class = 'category']", xmlValue, trim = TRUE) # to extract the value of that node ("Effectiveness")