Question

我知道R有点，但不是专业人士。我正在使用R.

开展文本挖掘项目

我在美联储网站上搜索了一个关键词，比如'通胀'。搜索结果的第二页有URL：（https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation）。

此页面有10个搜索结果（10个网址）。我想在R中编写一个代码，它将“读取”与这10个URL中的每一个相对应的页面，并将这些文本中的文本提取到.txt文件中。我唯一的输入是上面提到的URL。

感谢您的帮助。如果有任何类似的旧帖子，请转介我。谢谢。

Answer 1

你走了。对于主搜索页面，您可以使用正则表达式，因为URL可以在源代码中轻松识别。

（在https://statistics.berkeley.edu/computing/r-reading-webpages的帮助下）

library('RCurl')
library('stringr')
library('XML')

pageToRead <- readLines('https://search.newyorkfed.org/board_public/search?
start=10&Search=&number=10&text=inflation')
urlPattern <- 'URL: <a href="(.+)">'
urlLines <- grep(urlPattern, pageToRead, value=TRUE)

getexpr <- function(s,g)substring(s, g, g + attr(g, 'match.length') - 1)
gg <- gregexpr(urlPattern, urlLines)
matches <- mapply(getexpr, urlLines, gg)
result = gsub(urlPattern,'\\1', matches)
names(result) = NULL


for (i in 1:length(result)) {
  subURL <- result[i]

  if (str_sub(subURL, -4, -1) == ".htm") {
    content <- readLines(subURL)
    doc <- htmlParse(content, asText=TRUE)
    doc <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
    writeLines(doc, paste("inflationText_", i, ".txt", sep=""))

  }
}

但是，正如您可能已经注意到的那样，这只解析.htm页面，对于搜索结果中链接的.pdf文档，我建议您去那里查看：http://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/

Answer 2

这是如何废弃此页面的基本概念。如果有很多页面需要报废，它可能会很慢。现在你的问题有点含糊不清。您希望最终结果为 .txt 文件。那些有pdf的网页是什么？好的。您仍然可以使用此代码并将文件扩展名更改为pdf，以用于具有pdfs的网页。

 library(xml2)
 library(rvest)

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

  urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%
       .[!duplicated(.)]%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
           c(paste("tmp",1:length(.))))

这是上面代码的细分：要废弃的 url ：

 urll="https://search.newyorkfed.org/board_public/search?start=10&Search=&number=10&text=inflation"

获取所需的所有网址：

  allurls <- urll%>%read_html()%>%html_nodes("div#results a")%>%html_attr("href")%>%.[!duplicated(.)]

你想在哪里保存你的文本？创建临时文件：

 tmps <- tempfile(c(paste("tmp",1:length(allurls))),fileext=".txt")

目前。你的allurls属于班级角色。您必须将其更改为xml才能废弃它们。然后最后将它们写入上面创建的tmp文件中：

  allurls%>%lapply(function(x) read_html(x)%>%html_nodes("body"))%>%  
         Map(function(x,y) write_html(x,y,options="format"),.,tmps)

请不要遗漏任何东西。例如，在..."format"),之后有一段时期。考虑到这一点。现在您的文件已经写在 tempdir 中。要确定它们的位置，只需在控制台上键入命令tempdir()，它就会为您提供文件的位置。同时，您可以在tempfile命令中更改报废文件的位置。

希望这有帮助。

使用R从搜索结果URL中提取文本

2 个答案: