无法使用R中的xpathSapply选择特定的html表

时间:2016-08-17 21:04:19

标签: html r xpath web-scraping scrape

我正试图从以下链接http://cepea.esalq.usp.br/frango/?page=379&Dias=15

中删除第二个表格

我使用XML包尝试了以下R代码:

    p_frango_resfriado <- htmlTreeParse("http://cepea.esalq.usp.br/frango/?page=379&Dias=15", 
    useInternalNodes = TRUE, 
    encoding = "UTF-8")

    xpathSApply(p_frango_resfriado, "//table[@width = '95%']//tr//td[2]", xmlValue)
    xpathSApply(p_frango_resfriado, "//table[@width = '95%']//tr//td[3]", xmlValue)
    xpathSApply(p_frango_resfriado, "//table[@width = '95%']//tr//td[4]", xmlValue)

问题是这个代码在网页中同时删除了两个html表,我只想抓第二个。我试过,下面的代码,它没有返回任何有趣的东西:

xpathSApply(p_frango_resfriado, 
"//a[text() = 'Preços do frango resfriado CEPEA/ESALQ - Estado SP']/table[@width = '95%']", 
xmlValue)

有人可以帮我解决这个问题吗?我对XPath语言和HTML不是很了解。

2 个答案:

答案 0 :(得分:1)

XML::xmlToDataFrame与XPath查询

一起使用
library("httr")
library("XML")
URL <- "http://cepea.esalq.usp.br/frango/?page=379&Dias=15"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))

两个表之间的唯一区别是xpath查询中使用的表名

表1:Preços做frango congelado CEPEA / ESALQ - Estado SP

xpexpr <- "//center/a[contains(., 'do frango congelado')]/../table/tr/td/font/tr"

表2:Preços做frango resfriado CEPEA / ESALQ - Estado SP

xpexpr <- "//center/a[contains(., 'do frango resfriado')]/../table/tr/td/font/tr"

doc <- htmlParse(temp)
listofTableNodes <- getNodeSet(doc, xpexpr)
length_nodes <- length(listofTableNodes)
include_indices1 <- 1:(length_nodes - 2)

# create dataframe using xmlvalues of the nodelist. Both `getNodeSet()` 
# and `xpathSApply` will provide identical results.
# using `getNodeSet()`
df <- xmlToDataFrame(listofTableNodes[include_indices1], stringsAsFactors=FALSE)
# using `xpathSApply`
df <- xmlToDataFrame(xpathSApply(doc, xpexpr)[include_indices1], stringsAsFactors=FALSE)

# clean data
df$td <- as.Date(gsub("[Â ]\\s*", "", df$td), format = "%d/%m/%Y")
df[, 4] <- gsub("\t$", '', df[, 4])

# add column names
xpexpr <- "//center/a[contains(., 'do frango resfriado')]/../table/tr/td/font/text()"
# for Table-1
# xpexpr <- "//center/a[contains(., 'do frango congelado')]/../table/tr/td/font/text()"
listofTableNodes <- getNodeSet(doc, xpexpr)
colnames(df) <- c('Date', sapply(listofTableNodes, xmlValue))
df
#            Date Valor R$ Var./dia Var./mês
#   1  2016-08-17     4,37    0,46%     8,17%
#   2  2016-08-16     4,35    0,46%     7,67%
#   3  2016-08-15     4,33    0,46%     7,18%
#   4  2016-08-12     4,31    0,00%     6,68%
#   5  2016-08-11     4,31    0,70%     6,68%
#   6  2016-08-10     4,28    0,47%     5,94%
#   7  2016-08-09     4,26   -0,70%     5,45%
#   8  2016-08-08     4,29    3,87%     6,19%
#   9  2016-08-05     4,13    0,49%     2,23%
#   10 2016-08-04     4,11    0,00%     1,73%
#   11 2016-08-03     4,11    1,73%     1,73%
#   12 2016-08-02     4,04    0,00%     0,00%
#   13 2016-08-01     4,04    0,00%     0,00%
#   14 2016-07-29     4,04    0,00%    -0,49%
#   15 2016-07-28     4,04   -0,25%    -0,49%

注意:每天都会在此网页上更新这些值,我们会使用length_nodes将其考虑在内。

在没有XPath查询的情况下使用XML::readHTMLTable

library("httr")
library("XML")
URL <- "http://cepea.esalq.usp.br/frango/?page=379&Dias=15"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
df <- readHTMLTable(temp, stringAsFactors = FALSE, which = 8)
# Table 1
df[4:18,]
# Table 2
df[28:42,]

XML::readHTMLTable与XPath查询

一起使用
library("httr")
library("XML")
URL <- "http://cepea.esalq.usp.br/frango/?page=379&Dias=15"
temp <- tempfile(fileext = ".html")
GET(url = URL, user_agent("Mozilla/5.0"), write_disk(temp))
doc <- htmlParse(temp)

# XPath Query
# Table -1
xpexpr <- "//center/a[contains(., 'do frango congelado')]/../table/tr/td/font"
df <- xpathSApply(doc, xpexpr, readHTMLTable)
include_indices <- 1:(nrow(df[[4]]) -4)
df <- df[[4]][include_indices,]

# Table-2
xpexpr <- "//center/a[contains(., 'do frango resfriado')]/../table/tr/td/font"
df <- xpathSApply(doc, xpexpr, readHTMLTable)
include_indices <- 1:(nrow(df[[4]]) -4)
df <- df[[4]][include_indices,]

答案 1 :(得分:1)

现在应该可以使用了,但是我想知道如果你每天都运行它会是否会有效。

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="full pointer" style="background:#f4f4f4;display:inline-block;padding:7px 14px;margin:0;" id="open">
<input type="checkbox" style="-webkit-transform:scale(1.15);" /><br>
<div class="slide width" id="slide">
txt
</div>
</div>                            
<div class="divider" style="margin:0;"></div>
<div class="full pointer" style="background:#f4f4f4;display:inline-block;padding:7px 14px;margin:0;" id="open">
<input type="checkbox" style="-webkit-transform:scale(1.15);" /><br>
<div class="slide width" id="slide">
txt
</div>
</div>                            
<div class="divider" style="margin:0;"></div>
<div class="full pointer" style="background:#f4f4f4;display:inline-block;padding:7px 14px;margin:0;" id="open">
<input type="checkbox" style="-webkit-transform:scale(1.15);" /><br>
<div class="slide width" id="slide">
txt 
</div>
</div>                            
<div class="divider" style="margin:0;"></div>