如何从链接获取所有新闻标题并将其存储在r中

时间:2017-11-20 10:38:08

标签: r web-scraping rvest

我有以下链接,我想从中删除新闻标题

https://timesofindia.indiatimes.com/2017/11/1/archivelist/year-2017,month-11,starttime-43040.cms

我在r

中做了以下事情
library(rvest)
url = "https://timesofindia.indiatimes.com/2017/11/1/archivelist/year-2017,month-11,starttime-43040.cms"

results <- url %>%
read_html() %>%
html_nodes(xpath='/html/body/div[1]/table[1]')

results中没有数据。我想把这些新闻放在R数据帧中。 我该怎么办R

2 个答案:

答案 0 :(得分:1)

您可以使用a中的css选择器span来获取这些标题 - 如果您想要更简单的代码然后对其进行操作。

代码:

library(rvest)
url = "https://timesofindia.indiatimes.com/2017/11/1/archivelist/year-2017,month-11,starttime-43040.cms"

results <- url %>%
  read_html() %>%
  html_nodes('span a') %>% html_text()

results

输出:

    > results
      [1] "Not the same old Kochi anymore"                                                                    
      [2] "Ramu Chellappa’s next to be based in Coimbatore"                                                   
      [3] "Old is gold, cream n’ gold"             

答案 1 :(得分:1)

你可以这样做

library(rvest)

url = "https://timesofindia.indiatimes.com/2017/11/1/archivelist/year-2017,month-11,starttime-43040.cms"
page <- read_html(url)

titles <- html_text(html_nodes(page,'.cnt div td:nth-child(1) span a'))
titles[1:5]

  > titles[1:5]
[1] "Not the same old Kochi anymore"                                 "Ramu Chellappa’s next to be based in Coimbatore"               
[3] "Old is gold, cream n’ gold"                                     "Meme and troll pages play catalysts in promoting Kannada pride"
[5] "Thallu, Kidu, Oola... Creativity had no bounds in Slangyalam"