编码不正确吗?有各种编码语言。你是如何编码的?

时间:2017-07-16 17:08:13

标签: r unicode encoding utf-8 web-scraping

我正在解析一个互联网页面。

我发现了以下错误

Error: No guess has more than 50% confidence
In addition: There were 45 warnings (use warnings() to see them)
1: In stringi::stri_conv(x, from = from) :
  the Unicode codepoint \U000000a0 cannot be converted to destination encoding
(45 are omitted because they are all the same)

我写了#34; guess_encoding()"并显示以下编码。

我想" utf-8"是唯一的,但有各种

我目前正在使用" utf-8"仅

> guess_encoding(title)
       encoding language confidence
1        UTF-8                1.00
2 windows-1255       he       0.13
3 windows-1255       he       0.13
4     UTF-16BE                0.10
5     UTF-16LE                0.10
> guess_encoding(titles)
       encoding language confidence
1        UTF-8                1.00
2     UTF-16BE                0.10
3     UTF-16LE                0.10
4 windows-1255       he       0.05
5 windows-1255       he       0.04
6   IBM420_ltr       ar       0.03
7   IBM420_rtl       ar       0.02
> guess_encoding(content)
      encoding language confidence
1        UTF-8                1.00
2     UTF-16BE                0.10
3     UTF-16LE                0.10
4 windows-1255       he       0.04
5 windows-1255       he       0.04
6   IBM420_ltr       ar       0.02
7   IBM420_rtl       ar       0.01
> guess_encoding(contents)
      encoding language confidence
1        UTF-8                1.00
2     UTF-16BE                0.10
3     UTF-16LE                0.10
4 windows-1255       he       0.04
5 windows-1255       he       0.04
6   IBM420_ltr       ar       0.04
7   IBM420_rtl       ar       0.03
> guess_encoding(ans)
      encoding language confidence
    1    UTF-8                0.15

> guess_encoding(anss)
      encoding language confidence
1        UTF-8                1.00
2     UTF-16BE                0.10
3     UTF-16LE                0.10
4 windows-1255       he       0.10
5 windows-1255       he       0.07
6   IBM420_rtl       ar       0.02
7   IBM420_ltr       ar       0.02

这是我写的代码

我的编码错了吗?

library(httr)
library(rvest)
library(stringr)

# Bulletin URL
list.url = 'http://kin.naver.com/qna/list.nhn?m=expertAnswer&dirId=70111'

# Vector to store title and body
titles = c() #Subject of the question
contents = c() #The contents of questions
anss = c() #Answer the question

#  1 to 10 page bulletin crawling
for(i in 1:10){
  url = modify_url(list.url, query=list(page=i))  # Change the page in the bulletin URL
  h.list = read_html(url, encoding = 'UTF-8')  # Get a list of posts, read and save html files from url

  # Post link extraction
  title.link1 = html_nodes(h.list, '.title') #class of title
  title.links = html_nodes(title.link1, 'a') #title.link1 to a
  article.links = html_attr(title.links, 'href') 
  article.links = paste0("http://kin.naver.com",article.links) 

  #Extract attrribute
  for(link in article.links){
    h = read_html(link)  # Get the post

    # title
    title = html_text(html_nodes(h, '.end_question._end_wrap_box h3'))
    title = str_trim(repair_encoding(title))
    titles = c(titles, title)

    # content
    content = html_nodes(h, '.end_question .end_content._endContents')

    ## Mobile question content
    no.content = html_text(html_nodes(content, '.end_ext2'))
    content = repair_encoding(html_text(content))
    ## Mobile question content

    if (length(no.content) > 0)
    {
      content = str_replace(content, repair_encoding(no.content), '')
    }

    content <- str_trim(content)
    contents = c(contents, content)
    # answer    
    ans = html_text(html_nodes(h, '#contents_layer_1'))
    ans= str_trim(repair_encoding(ans))
    anss=c(anss,ans)

    print(link)

  }   

}

# save
result = data.frame(titles, contents, anss)

0 个答案:

没有答案