从网页中提取文本

时间:2013-11-27 13:31:25

标签: xml r parsing xpath

假设我想从bestbuy.com或walmart.com等网站提取客户评论。假设评论页面的片段如下所示:

<div class="BVRRReviewTitleContainer"><span class="BVRRLabel BVRRReviewTitlePrefix"></span> <h2>
<span itemprop="name" class="BVRRValue BVRRReviewTitle">Perfect size for the kids and durable</span> </h2>
<span class="BVRRLabel BVRRReviewTitleSuffix">, </span></div>
<div class="BVRRReviewDateContainer"><span class="BVRRLabel BVRRReviewDatePrefix"></span><span class="BVRRValue BVRRReviewDate">11/22/2013<meta itemprop="datePublished" content="2013-11-22"/></span><span class="BVRRLabel BVRRReviewDateSuffix"></span></div>
<div class="RRBeforeUserContainerSpacer"></div>
<div class="BVRRUserNicknameContainer"><span class="BVRRLabel BVRRUserNicknamePrefix">By </span><span class="BVRRValue BVRRUserNickname"><span itemprop="author" class="BVRRNickname">wilbuh </span></span> <span class="BVRRLabel BVRRUserNicknameSuffix">,</span>
<div class="BVRRUserLocationContainer"><span class="BVRRLabel BVRRUserLocationPrefix"></span><span class="BVRRValue BVRRUserLocation">Oakland, ME</span><span class="BVRRLabel BVRRUserLocationSuffix"></span></div></div>
<div class="BVRROverallRatingContainer" >
<div class="BVRRRatingContainerStar"><div class="BVRRRatingEntry BVRROdd"><div id="BVRRRatingOverall_Review_Display" class="BVRRRating BVRRRatingNormal BVRRRatingOverall"><div class="BVRRLabel BVRRRatingNormalLabel"></div><div class="BVRRRatingNormalImage">
<div class="BVImgOrSprite" style="width:75px;height:15px;overflow:hidden"><img src="http://walmart.ugc.bazaarvoice.com/1336/5_0/9/rating.png" alt="5 out of 5" title="5 out of 5" width="135" height="15" />
</div></div>
<div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating" class="BVRRRatingNormalOutOf"> <span itemprop="ratingValue" class="BVRRNumber BVRRRatingNumber">5</span>
<span class="BVRRSeparatorText">out of</span>
<span itemprop="bestRating" class="BVRRNumber BVRRRatingRangeNumber">5</span>
</div></div></div></div> </div>
<div class="RRReviewDisplayStyle2BeforeContentContainerSpacer"></div>
<div class="BVRRReviewDisplayStyle2ContentContainer">
<div class="BVRRReviewTextContainer"><div class="BVRRReviewTextParagraph BVRRReviewTextFirstParagraph BVRRReviewTextLastParagraph"><span itemprop="description" class="BVRRReviewText">Bought this tablet for my kids after I purchased a no name brand and it did not perform well at all. I have the 10.1, and absolutely love it and so this 7&quot; was the perfect compliment to it. Its an amazing tablet, easy to use, and durable for my 5 and 7 year old kids.</span>

是否有可能提取评论标题(“孩子们的完美尺寸和持久性”)和评论说明(“我购买了一个没有名牌的品牌之后为我的孩子购买了这款平板电脑,但它根本没有表现良好。我有10.1,并且非常喜欢它,所以这7“对它来说是完美的赞美。它是一款令人惊叹的平板电脑,易于使用,并且耐用于我的5岁和7岁的孩子。”)?我希望自动化提取所有评论标题和描述的过程。

1 个答案:

答案 0 :(得分:3)

问题是一个简单的xpath练习。但是您的XML文件已损坏。它错过了一些“div”标签。我更正了,您可以在this gist

中找到新版本
library(XML)
doc <- xmlParse(file='test.xml')

xpathSApply (doc,'//*[@class="BVRRValue BVRRReviewTitle"]',xmlValue)
[1] "Perfect size for the kids and durable"

xpathSApply (doc,'//*[@class="BVRRReviewTextContainer"]',xmlValue)
[1] "Bought this tablet for my kids after I purchased a no name brand and it 
     did not perform well at all. I have the 10.1, and absolutely 
     love it and so this 7\" was the perfect compliment to it. 
     Its an amazing tablet, easy to use, and durable for my 5 and 7 year old kids."