抓取每个链接页面并将其存储为XML表格

时间:2018-06-04 08:18:50

标签: r css-selectors rvest

您好我很高兴使用R从互联网上抓取数据,遗憾的是,对HTML和XML知之甚少。我试图在以下父页面上抓取每个故事链接:https://news.google.com/search?q=NREGA&hl=en-IN&gl=IN&ceid=IN%3Aen我不关心父页面上的任何其他链接,但需要创建一个包含URL列的表格,标题的故事,然后其余的页面的完整文本(可以是几段文字)。

我尝试使用rvest包并获得了网址,但真正的问题是越过所有文章并提取文本并将所有内容存储在表格中。

对于Google新闻应用:

print (df.dropna().astype(int).values.tolist())
[[149, 2653, 2152, 2468], [149, 1403, 2152, 1218], 
 [170, 2692, 2170, 2478], [170, 1413, 2170, 1288]]

1 个答案:

答案 0 :(得分:0)

I will provide the javascript examples since i am not aware of the library you are using.

1.Getting the links of all the urls :

var anchors = document.querySelectorAll("article  > a");
for(var i in anchors)
{
    console.log(anchors[i].getAttribute("href"));
}

2.Getting the headers of each url link :

 var headers = document.querySelectorAll("article  >  div:nth-of-type(1)");
 for(var i in headers)
 {
     console.log(headers[i].innerText);
 }

3.Getting the story once you navigated to that link :

var story = document.querySelector("div.full-details").innerText;
console.log(story);

This will fetch some extra details like number of shares on social media as visible on top, written by line , etc. If you want just the body without these details , you can get all paragraph elements using "document.querySelectorAll("div.full-details p")" and get innerText property for each of them which you can combine later.