我正在解析一个互联网页面。
我发现了以下错误
Error: No guess has more than 50% confidence
In addition: There were 45 warnings (use warnings() to see them)
1: In stringi::stri_conv(x, from = from) :
the Unicode codepoint \U000000a0 cannot be converted to destination encoding
(45 are omitted because they are all the same)
我写了#34; guess_encoding()"并显示以下编码。
我想" utf-8"是唯一的,但有各种
我目前正在使用" utf-8"仅
> guess_encoding(title)
encoding language confidence
1 UTF-8 1.00
2 windows-1255 he 0.13
3 windows-1255 he 0.13
4 UTF-16BE 0.10
5 UTF-16LE 0.10
> guess_encoding(titles)
encoding language confidence
1 UTF-8 1.00
2 UTF-16BE 0.10
3 UTF-16LE 0.10
4 windows-1255 he 0.05
5 windows-1255 he 0.04
6 IBM420_ltr ar 0.03
7 IBM420_rtl ar 0.02
> guess_encoding(content)
encoding language confidence
1 UTF-8 1.00
2 UTF-16BE 0.10
3 UTF-16LE 0.10
4 windows-1255 he 0.04
5 windows-1255 he 0.04
6 IBM420_ltr ar 0.02
7 IBM420_rtl ar 0.01
> guess_encoding(contents)
encoding language confidence
1 UTF-8 1.00
2 UTF-16BE 0.10
3 UTF-16LE 0.10
4 windows-1255 he 0.04
5 windows-1255 he 0.04
6 IBM420_ltr ar 0.04
7 IBM420_rtl ar 0.03
> guess_encoding(ans)
encoding language confidence
1 UTF-8 0.15
> guess_encoding(anss)
encoding language confidence
1 UTF-8 1.00
2 UTF-16BE 0.10
3 UTF-16LE 0.10
4 windows-1255 he 0.10
5 windows-1255 he 0.07
6 IBM420_rtl ar 0.02
7 IBM420_ltr ar 0.02
这是我写的代码
我的编码错了吗?
library(httr)
library(rvest)
library(stringr)
# Bulletin URL
list.url = 'http://kin.naver.com/qna/list.nhn?m=expertAnswer&dirId=70111'
# Vector to store title and body
titles = c() #Subject of the question
contents = c() #The contents of questions
anss = c() #Answer the question
# 1 to 10 page bulletin crawling
for(i in 1:10){
url = modify_url(list.url, query=list(page=i)) # Change the page in the bulletin URL
h.list = read_html(url, encoding = 'UTF-8') # Get a list of posts, read and save html files from url
# Post link extraction
title.link1 = html_nodes(h.list, '.title') #class of title
title.links = html_nodes(title.link1, 'a') #title.link1 to a
article.links = html_attr(title.links, 'href')
article.links = paste0("http://kin.naver.com",article.links)
#Extract attrribute
for(link in article.links){
h = read_html(link) # Get the post
# title
title = html_text(html_nodes(h, '.end_question._end_wrap_box h3'))
title = str_trim(repair_encoding(title))
titles = c(titles, title)
# content
content = html_nodes(h, '.end_question .end_content._endContents')
## Mobile question content
no.content = html_text(html_nodes(content, '.end_ext2'))
content = repair_encoding(html_text(content))
## Mobile question content
if (length(no.content) > 0)
{
content = str_replace(content, repair_encoding(no.content), '')
}
content <- str_trim(content)
contents = c(contents, content)
# answer
ans = html_text(html_nodes(h, '#contents_layer_1'))
ans= str_trim(repair_encoding(ans))
anss=c(anss,ans)
print(link)
}
}
# save
result = data.frame(titles, contents, anss)