我正在抓取一些网站。
链接不正确。 页面无法打开。
所以我想添加原始数据的链接
或许有一种比我想象的更好的方式。
如果有好方法请告诉我
-Ex -
[[错误的地址]]
/qna/detail.nhn?d1id=7&dirId=70111&docId=280474152
[[您要添加的文字]]
我想在我的代码前面添加一个地址(#Bulletin url)
Http:// ~naver.com
library(httr)
library(rvest)
library(stringr)
# Bulletin URL
list.url = 'http://kin.naver.com/qna/list.nhn?m=expertAnswer&dirId=70111'
# Vector to store title and body
titles = c()
contents = c()
# 1 to 10 page bulletin crawling
for(i in 1:10){
url = modify_url(list.url, query=list(page=i)) # Change the page in the bulletin URL
h.list = read_html(url, encoding = 'utf-8') # Get a list of posts, read and save html files from url
# Post link extraction
title.link1 = html_nodes(h.list, '.title') #class of title
title.links = html_nodes(title.link1, 'a') #title.link1 to a로
article.links = html_attr(title.links, 'href')
#Extract attrribute
for(link in article.links){
h = read_html(link) # Get the post
# title
title = html_text(html_nodes(h, '.end_question._end_wrap_box h3'))
title = str_trim(repair_encoding(title))
titles = c(titles, title)
# content
content = html_nodes(h, '.end_question .end_content._endContents')
## Mobile question content
no.content = html_text(html_nodes(content, '.end_ext2'))
content = repair_encoding(html_text(content))
## Mobile question content
## ex) http://kin.naver.com/qna/detail.nhn?d1id=8&dirId=8&docId=235904020&qb=7Jes65Oc66aE&enc=utf8§ion=kin&rank=19&search_sort=0&spq=1
if (length(no.content) > 0)
{
content = str_replace(content, repair_encoding(no.content), '')
}
content <- str_trim(content)
contents = c(contents, content)
print(link)
}
}
# save
result = data.frame(titles, contents)
答案 0 :(得分:0)
如果你在forloop之前添加article.links <- paste0("http://kin.naver.com", article.links)
,这似乎有效(正在运行)。