我希望从中找到的表中创建一个R中的data.frame http://netflixcanadavsusa.blogspot.ca/2013/11/alphabetical-list-k-4-am-fri-nov-22-2013.html#more
它由三列组成。前两列可能也可能不包含标志图像,第三列是文本。提取物是
<span class="listings">
<table>
<tr>
<td><img class="flag" src="http://bit.ly/Y9CbVZ" /></td>
<td></td>
<td><b><a target="_blank" href="http://movies.netflix.com/WiMovie/70187567">1000 Ways to Die - Season 3</a> (2010)</b> <i style="font-size:small"> 3.6 stars, 1 Season <a target="_blank" href="http://www.imdb.com/search/title?title=1000 Ways to Die - Season 3">imdb</a></i>
</td>
</tr>
<tr>
<td><img class="flag" src="http://bit.ly/Y9CbVZ" /></td>
<td><img class="flag" src="http://bit.ly/WXvnLp" /></td>
<td><b><a target="_blank" href="http://movies.netflix.com/WiMovie/100_Below_Zero/70273426?trkid=1889703">100 Below Zero</a> (2013)</b> <i style="font-size:small"> 2.8 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=100 Below Zero">imdb</a></i></td>
</tr>
</table>
</span>
所以这里第一行只在第一列有一个图像,第二行在两者中都有图像。我可以提取文本和图片网址,但无法匹配它们以考虑丢失的数据。这是我迄今为止所做的--URL指的是上面的网站,我刚刚显示了摘录的结果
library(XML)
myURL <- "http://netflixcanadavsusa.blogspot.ca/2013/11/alphabetical-list-k-4-am-fri-nov-22-2013.html#more"
basicInfo <- htmlParse(myURL, isURL = TRUE)
### text
df <- readHTMLTable(myURL,header=c("flag1","flag2","movie"), stringsAsFactors = FALSE)[[1]]
head(df,2)
# V1 V2 V3
# 1 1000 Ways to Die - Season 3 (2010) 3.6 stars, 1 Season imdb
# 2 100 Below Zero (2013) 2.8 stars, 1hr 28m imdb
### images
xpathSApply(basicInfo, "//*/span[@class='listings']/table/tr/td/img/@src")
# src src src
#"http://bit.ly/Y9CbVZ" "http://bit.ly/Y9CbVZ" "http://bit.ly/WXvnLp"
所以我有图像,但不知道它们适用于哪一行/列 在这个问题中,每列只能有一个特定的图像,因此知道它是否发生就足够了。更一般的情况可能有不同的行按行
TIA
答案 0 :(得分:1)
这是我怎么做的。它有点长,但它确实起作用。
library(XML)
basicInfo <- htmlParse(myURL, isURL = TRUE,encoding='UTF-8')
## for some reason the data is divided into 2 html tags
rows1 <- xpathSApply(basicInfo, "//*/span[@class='listings']/table/tr")
rows2 <- xpathSApply(basicInfo, "//*/span[@id='listings']/*/tr")
## for each element in the list I create a dsamll xml document containg
## all tds
ll <- lapply(c(rows1,rows2),function(x)xpathSApply(xmlDoc(x),'//*/td'))
ull <- unlist(ll)
## function to parse the tag imag from the xml document
## if the td don't contain an img it returns an NA
parse.img <- function(x){
res <- xpathSApply(xmlDoc(x),'//img',xmlGetAttr,'src')
ifelse(length(res)==0,NA,res)
}
col1 <- unlist(lapply(ull[c(T,F,F)],parse.img))
col2 <- unlist(lapply(ull[c(F,T,F)],parse.img))
## the third column contain text so I use xmlValue to extract it
col3 <- unlist(lapply(ull[c(F,F,T)],
function(x)xpathSApply(xmlDoc(x),'//td',xmlValue)))
res <- data.frame(col1,col2,col3)
head(res)
col1 col2 col3
1 http://bit.ly/Y9CbVZ <NA> 1000 Ways to Die - Season 3 (2010)  3.6 stars, 1 Season  imdb
2 http://bit.ly/Y9CbVZ <NA> 1000 Ways to Die - Season 3 (2010)  3.6 stars, 1 Season  imdb
3 http://bit.ly/Y9CbVZ http://bit.ly/WXvnLp 100 Below Zero (2013)  2.8 stars, 1hr 28m  imdb
4 http://bit.ly/Y9CbVZ http://bit.ly/WXvnLp 100 Ghost Street: The Return of Richard Speck (2012)  3 stars, 1hr 23m  imdb
5 <NA> http://bit.ly/WXvnLp 100 Million BC (2008)  2.8 stars, 1hr 25m  imdb
6 <NA> http://bit.ly/WXvnLp 100 Years Of Evil (2012)  2.7 stars, 1hr 19m  imdb