从csv.file调用列以提取其数据

时间:2018-04-12 16:12:11

标签: r

我导入了我想在r中使用的csv文件。在这里,我试图调用csv文件中的一个列。此列包含一个标题为“URL”的网址列表。然后,我想要从每个网址中删除数据的代码。简而言之,我想使用比列出c()函数中所有url更有效的方法,因为我有大约200个链接。

https://www.nytimes.com/2018/04/07/health/health-care-mergers-doctors.html?rref=collection%2Fsectioncollection%2Fhealth https://www.nytimes.com/2018/04/11/well/move/why-exercise-alone-may-not-be-the-key-to-weight-loss.html?rref=collection%2Fsectioncollection%2Fhealth https://www.nytimes.com/2018/04/07/health/antidepressants-withdrawal-prozac-cymbalta.html?rref=collection%2Fsectioncollection%2Fhealth https://www.nytimes.com/2018/04/09/well/why-you-should-get-the-new-shingles-vaccine.html?rref=collection%2Fsectioncollection%2Fhealth https://www.nytimes.com/2018/04/09/health/fda-essure-bayer-contraceptive-implant.html?rref=collection%2Fsectioncollection%2Fhealth https://www.nytimes.com/2018/04/09/health/hot-pepper-thunderclap-headaches.html?rref=collection%2Fsectioncollection%2Fhealth

运行此错误时出现错误:article <- links %>% map(read_html)

它给了我这样的信息:

(Error in UseMethod("read_xml") : 
no applicable method for 'read_xml' applied to an object of class "factor")

以下是代码:

setwd("C:/Users/Majed/Desktop")

d <- read.csv("NYT.csv")

d

links<- d$URLs

article <- links %>% map(read_html)

title <-
  article %>% map_chr(. %>% html_node("title") %>% html_text())

content <-
  article %>% map_chr(. %>% html_nodes(".story-body-text") %>% html_text() %>% paste(., collapse = ""))

article_table <- data.frame("Title" = title, "Content" = content)

1 个答案:

答案 0 :(得分:1)

请注意错误消息的含义:read_html需要一个字符串,但您要给它一个因素。除非您包含参数read.csv,否则stringsAsFactors = F会将字符串转换为因子。 read_csv来自readr的{​​{1}}是一个很好的选择,如果你像我一样忘记你不想让字符串自动变成因素。

我无法在没有您的数据的情况下重现问题,但请尝试将网址转换为字符串:

links <- as.character(d$URLs)

article <- links %>% map(read_html)