使用R进行Web抓取

时间:2014-10-09 07:24:24

标签: html xml r web-scraping

<div data-projects-path="/pt/projects" id="explore_results">
  <div class="results">
    <div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork">
      <meta content="2014-08-30" itemprop="dateCreated">
      <div class="image">
        <a href="/pt/ospassosdabia" target="" title="Os passos da Bia">
          <img alt="Project thumb bia" height="172" src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7229/project_thumb_Bia.png" width="220">
        </a>
    <div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork">
      <meta content="2014-09-19" itemprop="dateCreated">
      <div class="image">
        <a href="/pt/livrepartida" target="" title="Livre Partida">
          <img alt="Project thumb logo colorido" height="172" src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7613/project_thumb_logo_colorido.jpg" width="220">
        </a>

这是我想要用R抓取的示例HTML代码。我只需要/pt/....作为/pt/livrepartida/pt/ospassosdabia

当我向下滚动网页时,会出现更多类似的代码,并出现更多类似的术语(&#34; pt /....")。

我想得到所有这些&#34; pt /...."来自网站。我怎么能这样做?

2 个答案:

答案 0 :(得分:3)

你应该提供比这个被截断的更好的格式化html。幸运的是,htmlParse可以解析这种损坏的格式。

library(XML)

dd <- htmlParse(your_text,asText=TRUE)

然后你得到href属性:

xpathSApply(dd,'//a',xmlGetAttr,'href')
[1] "/pt/ospassosdabia"

答案 1 :(得分:2)

尝试

library(XML)
doc1 <- htmlParse(lines)
unname(xpathSApply(doc1, "//a/@href"))
#[1] "/pt/ospassosdabia"


lines <- readLines(textConnection('<div data-projects-path="/pt/projects"    id="explore_results">
 <div class="results">
 <div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork">
 <meta content="2014-08-30" itemprop="dateCreated">
 <div class="image">
 <a href="/pt/ospassosdabia" target="" title="Os passos da Bia">
<img alt="Project thumb bia" height="172" 
  src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7229/project_thumb_Bia.png"
  width="220">
  </a>'))