R卷曲表格

时间:2014-12-06 14:09:26

标签: r curl rcurl

我想发布一个表单并使用返回的数据。

页面我想得到的数据是: http://www.bigpara.com/analiz/mali-tablolar/

assetscrap <- function(sirket){
  a <- postForm("http://www.bigpara.com/analiz/mali-tablolar/",
    Yil = "2013", Donem = "4", Kur = "TL", Cins = "1", Submit = "Getir",
    HisseKod = sirket);
  a <- htmlParse(a);
  span <- xpathSApply(a, "//div[@class='maliTable']//li//span", xmlValue);
  small <- xpathSApply(a, "//div[@class='maliTable']//li//small", xmlValue);
  small <- gsub("[.]","",small);
  small <- as.numeric(small);
  cikti <- data.table(span, small);
  cikti <- cikti[cikti$span == "AKTİF TOPLAMI" | cikti$span == "A K T İ F T O P L A M I"];
  cikti <- cikti[order(-small)];
  cikti <- cikti[1,]$small;
}

代表。当我运行assetscrap("FROTO")函数时,它返回

* About to connect() to www.bigpara.com port 80 (#0)
*   Trying 83.66.15.71... * connected
* Connected to www.bigpara.com (83.66.15.71) port 80 (#0)
> POST /analiz/mali-tablolar/ HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36
Host: www.bigpara.com
Accept: */*
Referer: http://www.bigpara.com/analiz/mali-tablolar/
Content-Length: 627
Expect: 100-continue
Content-Type: multipart/form-data; boundary=----------------------------b1006fa82edf

< HTTP/1.1 100 Continue

< HTTP/1.1 200 OK
< Cache-Control: private
< Content-Length: 182029
< Content-Type: text/html; Charset=UTF-8
< Server: Microsoft-IIS/7.5
< Set-Cookie: ASPSESSIONIDCCTSBQAT=HOOCGCIBDPNEJMFGGFGGHNPM; path=/
< X-Powered-By: ASP.NET
< Date: Sat, 06 Dec 2014 14:00:12 GMT
< Set-Cookie: NSC_cjhqbsb_iuuq_WJQ=ffffffff504a9f5645525d5f4f58455e445a4a42367f;Version=1;path=/;httponly

< 
* Connection #0 to host www.bigpara.com left intact

我忽视的是什么?我认为我正确地做了一切,但服务器没有回复我的请求

2 个答案:

答案 0 :(得分:1)

为什么说服务器没有响应?您获得状态200(OK),响应长度为182,000字节??

POST请求正常。你的问题在于:

cikti <- cikti[cikti$span == "AKTİF TOPLAMI" | cikti$span == "A K T İ F T O P L A M I"];

返回0行。这里有几个错误:

首先,span列中的文字具有混合编码:

head(Encoding(span),20)
#  [1] "UTF-8"   "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "unknown" "UTF-8"   "UTF-8"  
#  [9] "UTF-8"   "UTF-8"   "UTF-8"   "unknown" "UTF-8"   "UTF-8"   "UTF-8"   "unknown"
# [17] "UTF-8"   "UTF-8"   "UTF-8"   "unknown"

您可以使用

解决此问题
span  <- iconv(span,from="UTF-8",to="")
提取span字符串后立即

其次,您的第二个条件:cikti$span == "A K T İ F T O P L A M I"中不存在cikti。单词之间有3个空格,例如"A K T İ F T O P L A M I"

第三,data.tables不是数据框架。这是非常糟糕的做法,例如,

cikti <- cikti[cikti$span == "AKTİF TOPLAMI" ...]

改为使用:

cikti <- cikti[span == "AKTİF TOPLAMI" ...]

滚动所有,这段代码工作(在我的系统上......)。

a <- postForm("http://www.bigpara.com/analiz/mali-tablolar/",
              Yil = "2013", Donem = "4", Kur = "TL", Cins = "1", Submit = "Getir",
              HisseKod = sirket)
a <- htmlParse(a)
span  <- xpathSApply(a, "//div[@class='maliTable']//li//span", xmlValue)
span  <- iconv(span,from="UTF-8",to="")  
small <- xpathSApply(a, "//div[@class='maliTable']//li//small", xmlValue)
small <- gsub("[.]","",small)
small <- as.numeric(small)
cikti <- data.table(span, small)
cikti <- cikti[span == "AKTİF TOPLAMI" | span == "A K T İ F   T O P L A M I"] 
cikti <- cikti[order(-small)]                             
cikti <- cikti[1,]$small

答案 1 :(得分:0)

如果你不想搞乱编码,httr和rvest会自动为你处理:

res <- POST("http://www.bigpara.com/analiz/mali-tablolar/",
  body = list(
    Yil = "2013", 
    Donem = "4", 
    Kur = "TL", 
    Cins = "1", 
    HisseKod = "FENER"
  ),
  encode = "form"
)

mali_table <- html(res) %>% html_nodes("div.maliTable li")

span <- mali_table %>% html_nodes("span") %>% html_text()

small <- mali_table %>% 
  html_nodes("small") %>% 
  html_text() %>%
  gsub("\\.", "", .) %>%
  as.numeric()

selected <- span %in% c("AKTİF TOPLAMI", "A K T İ F   T O P L A M I")

data.frame(
  span = span[selected],
  small = small[selected]
)