如何将多个编码统一为一个?

时间:2017-07-17 10:13:36

标签: r encoding utf-8 web-scraping utf-16

我想用utf-8编码。

然而,有7种guess_encoding()。

我应该用utf-8做这一切吗?

>guess_encoding(text)
  encoding language confidence
1    UTF-8                0.15

>guess_encoding(text)
  encoding language confidence
1        UTF-8                1.00
2     UTF-16BE                0.10
3     UTF-16LE                0.10
4 windows-1255       he       0.07
5 windows-1255       he       0.06
6   IBM420_ltr       ar       0.04
7   IBM420_rtl       ar       0.02 

“guess_encoding()”的含义是否表示编码的结构?

无法使用“repair_encoding”代码进行编码?

我编码了这段代码,但似乎没有正常工作。我应该使用“iconv”吗?

我必须对此进行6次编码吗?

iconv(text, from="UTF-16BE", to="UTF8") 
iconv(text, from="UTF-16LE", to="UTF8")
iconv(text, from="windows-1255", to="UTF8") 
#Omitted below

整个代码将作为参考发布。

问题的内容可能难以理解

我整整地放了整个代码。

library(httr)
library(rvest)
library(stringr)


# Bulletin URL
list.url = 'http://kin.naver.com/qna/list.nhn?m=expertAnswer&dirId=70111'

# Vector to store title and body
text = c() #Answer the question

#  1 to 10 page bulletin crawling
for(i in 1:10){
  url = modify_url(list.url, query=list(page=i))  # Change the page in the bulletin URL
  h.list = read_html(url, encoding = 'UTF-8')  # Get a list of posts, read and save html files from url

  # Post link extraction
  title.link1 = html_nodes(h.list, '.title') #class of title
  title.links = html_nodes(title.link1, 'a') #title.link1 to a

  article.links = html_attr(title.links, 'href') 
  article.links = paste0("http://kin.naver.com",article.links) 

  #Extract attrribute
  for(link in article.links){
    h = read_html(link)  # Get the post

    # answer    
    text = html_text(html_nodes(h, '#contents_layer_1'))
    text= str_trim(repair_encoding(texts))
    texts=c(texts,text)

    print(link)

  }
}

0 个答案:

没有答案