<div data-projects-path="/pt/projects" id="explore_results">
<div class="results">
<div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork">
<meta content="2014-08-30" itemprop="dateCreated">
<div class="image">
<a href="/pt/ospassosdabia" target="" title="Os passos da Bia">
<img alt="Project thumb bia" height="172" src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7229/project_thumb_Bia.png" width="220">
</a>
<div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork">
<meta content="2014-09-19" itemprop="dateCreated">
<div class="image">
<a href="/pt/livrepartida" target="" title="Livre Partida">
<img alt="Project thumb logo colorido" height="172" src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7613/project_thumb_logo_colorido.jpg" width="220">
</a>
这是我想要用R抓取的示例HTML代码。我只需要/pt/....
作为/pt/livrepartida
和/pt/ospassosdabia
。
当我向下滚动网页时,会出现更多类似的代码,并出现更多类似的术语(&#34; pt /....")。
我想得到所有这些&#34; pt /...."来自网站。我怎么能这样做?
答案 0 :(得分:3)
你应该提供比这个被截断的更好的格式化html。幸运的是,htmlParse
可以解析这种损坏的格式。
library(XML)
dd <- htmlParse(your_text,asText=TRUE)
然后你得到href属性:
xpathSApply(dd,'//a',xmlGetAttr,'href')
[1] "/pt/ospassosdabia"
答案 1 :(得分:2)
尝试
library(XML)
doc1 <- htmlParse(lines)
unname(xpathSApply(doc1, "//a/@href"))
#[1] "/pt/ospassosdabia"
lines <- readLines(textConnection('<div data-projects-path="/pt/projects" id="explore_results">
<div class="results">
<div class="project-box" itemscope="" itemtype="http://schema.org/CreativeWork">
<meta content="2014-08-30" itemprop="dateCreated">
<div class="image">
<a href="/pt/ospassosdabia" target="" title="Os passos da Bia">
<img alt="Project thumb bia" height="172"
src="http://s3.amazonaws.com/cdn.catarse/uploads/project/uploaded_image/7229/project_thumb_Bia.png"
width="220">
</a>'))