我想用utf-8编码。
然而,有7种guess_encoding()。
我应该用utf-8做这一切吗?
>guess_encoding(text)
encoding language confidence
1 UTF-8 0.15
>guess_encoding(text)
encoding language confidence
1 UTF-8 1.00
2 UTF-16BE 0.10
3 UTF-16LE 0.10
4 windows-1255 he 0.07
5 windows-1255 he 0.06
6 IBM420_ltr ar 0.04
7 IBM420_rtl ar 0.02
“guess_encoding()”的含义是否表示编码的结构?
无法使用“repair_encoding”代码进行编码?
我编码了这段代码,但似乎没有正常工作。我应该使用“iconv”吗?
我必须对此进行6次编码吗?
iconv(text, from="UTF-16BE", to="UTF8")
iconv(text, from="UTF-16LE", to="UTF8")
iconv(text, from="windows-1255", to="UTF8")
#Omitted below
整个代码将作为参考发布。
问题的内容可能难以理解
我整整地放了整个代码。
library(httr)
library(rvest)
library(stringr)
# Bulletin URL
list.url = 'http://kin.naver.com/qna/list.nhn?m=expertAnswer&dirId=70111'
# Vector to store title and body
text = c() #Answer the question
# 1 to 10 page bulletin crawling
for(i in 1:10){
url = modify_url(list.url, query=list(page=i)) # Change the page in the bulletin URL
h.list = read_html(url, encoding = 'UTF-8') # Get a list of posts, read and save html files from url
# Post link extraction
title.link1 = html_nodes(h.list, '.title') #class of title
title.links = html_nodes(title.link1, 'a') #title.link1 to a
article.links = html_attr(title.links, 'href')
article.links = paste0("http://kin.naver.com",article.links)
#Extract attrribute
for(link in article.links){
h = read_html(link) # Get the post
# answer
text = html_text(html_nodes(h, '#contents_layer_1'))
text= str_trim(repair_encoding(texts))
texts=c(texts,text)
print(link)
}
}