我正在进行网页抓取。
以下是我使用的代码。
我对评论写了一些评论。
library(httr)
library(rvest)
library(stringr)
# Bulletin board url
List.of.questions.url<- 'http://kin.naver.com/qna/list.nhn?m=noanswer&dirId=70108'
# Vector to store title and body
answers <- c()
# get the posts from page 1 to page 2.
for(i in 1:2){
url <- modify_url(List.of.questions.url, query=list(page=i))
list <- read_html(url, encoding = 'utf-8') #I think I encoded, but I'm getting an error.
# Gets the url of the post.
# TLS = title.links, CLS = content.links
TLS <- html_nodes(list, '.basic1 dt a')
CLS <- html_attr(TLS, 'href')
CLS <- paste0("http://kin.naver.com",CLS)
#Gets the required properties.
for(link in CLS){
h <- read_html(link)
# answer
answer <- html_text(html_nodes(h, '#contents_layer_1'))
answer <- str_trim(repair_encoding(answer)) #I think I encoded, but I'm getting an error.
answers<-c(answers,answer)
print(link)
}
}
但是,在抓取时会发生此错误。
也许是关于编码。
(但正如我在评论中写的那样,我认为我正确地进行了编码。)
[1] "http://kin.naver.com/qna/detail.nhn?d1id=7&dirId=70111&docId=280474910"
Error: No guess has more than 50% confidence
In addition: There were 43 warnings (use warnings() to see them)
> warnings()
1: In stringi::stri_conv(x, from = from) :
the Unicode codepoint \U000000a0 cannot be converted to destination encoding
2: In stringi::stri_conv(x, from = from) :
the Unicode codepoint \U000000a0 cannot be converted to destination encoding
3: In stringi::stri_conv(x, from = from) :
the Unicode codepoint \U000000a0 cannot be converted to destination encoding
4: In stringi::stri_conv(x, from = from) :
the Unicode codepoint \U000000a0 cannot be converted to destination encoding
5: In stringi::stri_conv(x, from = from) :
#All the same contents, so omitted
我该如何解决?
感谢您的建议