r Google新闻结果链接

时间:2018-08-18 09:49:12

标签: html r xml web-scraping

我是从Web上获取信息到R的新手,但是我发现了这段不错的代码How to get google search results,它是关于如何从普通的Google搜索到R的链接。

我需要为Google NEWS搜索运行此方法。 我知道我必须通过添加“&source = lnms&tbm = nws”之类的内容来更改网址。 如果我将其从R复制并粘贴到浏览器中,则我构造的网址会将我带到正确的新闻结果页面-到目前为止一切顺利。

我正在查看新闻结果页面的html代码,发现该信息位于h3 [@ class ='r dO0Ag']内部,但还有另一个节点,我不知道该如何编码。 将不胜感激! Screenshot of HTML 1st News Result for China

library(XML)
library(RCurl)



getGoogleURL <- function(search.term, domain = '.de', quotes=TRUE) 
{
  search.term <- gsub(' ', '%20', search.term)
  if(quotes) search.term <- paste('%22', search.term, '%22', sep='') 
  #construct google news url
  getGoogleURL <- paste('http://www.google', domain, '/search?q=',
                        search.term, sep='',"&source=lnms&tbm=nws")
  return(getGoogleURL)
}

getGoogleLinks <- function(google.url) {
  doc <- getURL(google.url, httpheader = c("User-Agent" = "R
                                           (2.10.0)"))
  html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
                        (...){})
  #?? Wrong part - gives error evaluating xpath expression ??
  nodes <- getNodeSet(html, "//h3[@class='r dO0Ag']//a[@class='l lLrAF'//")

  dirt_links=sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]])

  links <- gsub('/url\\?q=','',sapply(strsplit(dirt_links[as.vector(grep('url',dirt_links))],split='&'),'[',1))
  return(links)
}

search.term <- "China"
quotes <- "TRUE"
search.url <- getGoogleURL(search.term=search.term, quotes=quotes)

links <- getGoogleLinks(search.url)

1 个答案:

答案 0 :(得分:2)

您在这里有很多选择。

RCurlRSelenium都可以使用。

关键是要生成正确的URL:

> library(XML)
> library(RCurl)
> search.term <- "china"
> quotes=FALSE
> start=0
> getGoogleURL <- paste('http://www.google.com',
+                       '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
+                       search.term, "&start=",start,sep='')
> getGoogleURL
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=china&start=0"
> 

这时,您可以取消引用URL并创建HTML解析树并提取节点数据。 start引用允许您设置结果的返回页。即我想返回第四页(从零开始计数)

工作代码示例:

library(XML)
library(RCurl)

getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
  search.term <- gsub(' ', '%20', search.term)
  if(quotes) search.term <- paste('%22', search.term, '%22', sep='')
  getGoogleURL <- paste('http://www.google.com',
                        '/search?hl=en&gl=kr&tbm=nws&authuser=0&q=',
                        search.term, "&start=",start,sep='')
  getGoogleURL <- URLencode(getGoogleURL)
}


getGoogleNews <- function(search.term="China",
                          start=0,
                          quotes=FALSE ){
  google.url <- getGoogleURL(search.term=search.term,
                             start, quotes=quotes)
  print(google.url)
  doc <- getURL(google.url,
                httpheader = c("User-Agent" = "R(3.0.3)"))
  html <- htmlTreeParse(doc, useInternalNodes = TRUE,
                        error=function(...){}, asText = TRUE)
  nodes <- getNodeSet(html, "//*/h3/a[@href]")
  title <- sapply(nodes, function(x) x <- xmlValue(x))
  url <- unname(sapply(nodes, function(x) x <- xmlAttrs(x)))
  url <- gsub("\\/url\\?q=", "", url)
  nodes <- getNodeSet(html, "//div[@class='slp']")
  source <- sapply(nodes, function(x) x <- xmlValue(x))
  nodes <- getNodeSet(html, "//div[@class='st']")
  summary <- sapply(nodes, function(x) x <- xmlValue(x))
  data.frame(title=title, source=source, url=url, summary=summary)
}

getGoogleNews("China")
getGoogleNews("China", 1)
getGoogleNews("China", 2)

运行时:

> library(XML)

> library(RCurl)

> getGoogleURL <- function(search.term, start=0, quotes=FALSE) {
+   search.term <- gsub(' ', '%20', search.term)
+   if(quotes) search.term <- paste( .... [TRUNCATED] 

> getGoogleNews <- function(search.term="China",
+                           start=0,
+                           quotes=FALSE ){
+   google.url <- ge .... [TRUNCATED] 

> getGoogleNews("China")
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=0"
                                                                         title
1     Taiwan says China is 'out of control' as it loses El Salvador to Beijing
2  China central bank official rebuts Trump's claim it is manipulating the ...
3                                         Airbnb Wants to Find a Home in China
4          China's biggest risk may be its property market — not the trade war
5      Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
6                                     China reaches 800 million internet users
7       China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
8                     7 Signs that China's Military is Becoming More Dangerous
9           Asia markets trade mostly higher as investors look ahead to US ...
10       Can China, the world's biggest pork producer, contain a fatal pig ...
                                               source
1                                 CNBC - 17 hours ago
2                                 CNBC - 10 hours ago
3                                WIRED - 13 hours ago
4                                 CNBC - 23 hours ago
5                     Business Insider - 11 hours ago
6                           TechCrunch - 10 hours ago
7                        Express.co.uk - 12 hours ago
8  The National Interest Online (blog) - 16 hours ago
9                                 CNBC - 17 hours ago
10                     Science Magazine - 5 hours ago
                                                                                                                                                                                                                   url
1                      https://www.cnbc.com/2018/08/21/taiwan-says-china-out-of-control-as-it-loses-el-salvador-to-beijing.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIFCgAMAA&usg=AOvVaw2cSTmS65-6IvKQV9xrl3y3
2                          https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIHSgAMAE&usg=AOvVaw2q7yr2oBWHib3bRAVmOna-
3                                                                              https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIJigAMAI&usg=AOvVaw2a2LSkYlosnwTFRCvjmUhm
4                          https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIKSgAMAM&usg=AOvVaw1bUY5Ii7AlWURDifpeozJU
5                       https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIILCgAMAQ&usg=AOvVaw0yGdVilstHZVBBXEuuAbmu
6                                                   https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIINSgAMAU&usg=AOvVaw0VYTngAb-OBUSYkxKs0ZKp
7  https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIOCgAMAY&usg=AOvVaw3W5adCnWdzz71zvpgE1x6D
8                                  https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIPigAMAc&usg=AOvVaw1k05lyvFRrx_FImDKIsZ61
9                                               https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIQSgAMAg&usg=AOvVaw0YqzZPNbH9bawkv8qX8Bdm
10 http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi28IGAt__cAhXCj1QKHb0rDPcQqQIIRCgAMAk&usg=AOvVaw1H0c03l4trLI3cbRRlnKJW
                                                                                                                                                                    summary
1                            Taiwan vowed on Tuesday to fight China's "increasingly out of control" behavior after Taipei lost another ally to Beijing when El Salvador ...
2                   A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
3                                  China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
4                       China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
5  The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
6                            A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
7                             China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
8                          Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
9                             Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
10                       As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...

> getGoogleNews("China", 1)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=1"
                                                                         title
1  China central bank official rebuts Trump's claim it is manipulating the ...
2                                         Airbnb Wants to Find a Home in China
3          China's biggest risk may be its property market — not the trade war
4      Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
5                                     China reaches 800 million internet users
6       China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
7                     7 Signs that China's Military is Becoming More Dangerous
8           Asia markets trade mostly higher as investors look ahead to US ...
9        Can China, the world's biggest pork producer, contain a fatal pig ...
10      How China, India and the US use healthcare aid to win influence in ...
                                               source
1                                 CNBC - 10 hours ago
2                                WIRED - 13 hours ago
3                                 CNBC - 23 hours ago
4                     Business Insider - 11 hours ago
5                           TechCrunch - 10 hours ago
6                        Express.co.uk - 12 hours ago
7  The National Interest Online (blog) - 16 hours ago
8                                 CNBC - 17 hours ago
9                      Science Magazine - 5 hours ago
10                             ABC News - 5 hours ago
                                                                                                                                                                                                                      url
1                          https://www.cnbc.com/2018/08/21/china-official-refutes-trumps-claim-it-is-manipulating-the-yuan.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggUKAAwAA&usg=AOvVaw1Muu65XvSSWVKX06-5syLY
2                                                                              https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggdKAAwAQ&usg=AOvVaw0Py7bJDY3tIj4KxgwYot1A
3                          https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgggKAAwAg&usg=AOvVaw2EHMCQvFQV9ubu17ERCZFO
4                       https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggjKAAwAw&usg=AOvVaw1sMhG0tyUnj8j2W02gD3aW
5                                                   https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggsKAAwBA&usg=AOvVaw1ODs1JY8V_ETi24ugz-yNn
6  https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAggvKAAwBQ&usg=AOvVaw0r0HQNfZhEwfbiEocUC74Z
7                                  https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg1KAAwBg&usg=AOvVaw2hpQQXrAm2HW158II7F1kG
8                                               https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg4KAAwBw&usg=AOvVaw2surM3fW-lLJDd9P-r7xJB
9  http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg7KAAwCA&usg=AOvVaw3Lzvks6B0Un4IEgoMh86re
10                               http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwjakZ6At__cAhXjllQKHZEQA9E4ARCpAgg-KAAwCQ&usg=AOvVaw1Ogg8I6mUvDSCc9F90Usg4
                                                                                                                                                                    summary
1                   A senior official of China's central bank told a briefing on Tuesday that the yuan's exchange rate is set by the market, rebutting President Donald ...
2                                  China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
3                       China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
4  The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
5                            A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
6                             China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
7                          Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
8                             Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
9                        As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
10                          China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...

> getGoogleNews("China", 2)
[1] "http://www.google.com/search?hl=en&gl=kr&tbm=nws&authuser=0&q=China&start=2"
                                                                      title
1                                      Airbnb Wants to Find a Home in China
2       China's biggest risk may be its property market — not the trade war
3   Malaysia has axed $22 billion of Chinese-backed projects, in a blow ...
4                                  China reaches 800 million internet users
5    China DEFIES Trump to buy nearly ALL oil imports from Iran despite ...
6                  7 Signs that China's Military is Becoming More Dangerous
7        Asia markets trade mostly higher as investors look ahead to US ...
8     Can China, the world's biggest pork producer, contain a fatal pig ...
9    How China, India and the US use healthcare aid to win influence in ...
10 China Is Leading in Artificial Intelligence--and American Businesses ...
                                               source
1                                WIRED - 13 hours ago
2                                 CNBC - 23 hours ago
3                     Business Insider - 11 hours ago
4                           TechCrunch - 10 hours ago
5                        Express.co.uk - 12 hours ago
6  The National Interest Online (blog) - 16 hours ago
7                                 CNBC - 17 hours ago
8                      Science Magazine - 5 hours ago
9                              ABC News - 5 hours ago
10                             Inc.com - 16 hours ago
                                                                                                                                                                                                                      url
1                                                                              https://www.wired.com/story/airbnb-china-market/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggUKAAwAA&usg=AOvVaw3M4FbZ71J-NVKHn3fHvYwZ
2                          https://www.cnbc.com/2018/08/21/china-economy-biggest-risk-may-be-property-market-not-trade-war.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggXKAAwAQ&usg=AOvVaw3vieYvDvTlRzYkWncLgQfu
3                       https://www.businessinsider.com/malaysia-axes-22-billion-of-belt-and-road-projects-blow-to-china-2018-8&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggaKAAwAg&usg=AOvVaw3JGNk2Lraivca0P1lS3CoY
4                                                   https://techcrunch.com/2018/08/21/china-reaches-800-million-internet-users/&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggjKAAwAw&usg=AOvVaw2j4-NkfK_fNl8McD6WJjPa
5  https://www.express.co.uk/news/world/1006297/Iran-oil-china-donald-trump-oil-prices-oil-price-us-iran-nuclear-deal-sanctions&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggmKAAwBA&usg=AOvVaw0v1Lybg2SxcJoxVkP7sOx_
6                                  https://nationalinterest.org/blog/buzz/7-signs-chinas-military-becoming-more-dangerous-29352&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggsKAAwBQ&usg=AOvVaw1B7Krdzgd3LQEJ4bwWSSFW
7                                               https://www.cnbc.com/2018/08/21/asia-markets-us-china-trade-talks-in-focus.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggvKAAwBg&usg=AOvVaw0v734CDRel2Vpke9XVjLqA
8  http://www.sciencemag.org/news/2018/08/can-china-world-s-biggest-pork-producer-contain-fatal-pig-virus-scientists-fear-worst&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAggyKAAwBw&usg=AOvVaw1j6E7a1jk9JiIahN5pdmi7
9                                http://www.abc.net.au/news/2018-08-22/china-india-us-medical-diplomacy-in-the-pacific/10147632&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg1KAAwCA&usg=AOvVaw2E0qGfLhOkKZWhh5-_Is54
10                                              https://www.inc.com/magazine/201809/amy-webb/china-artificial-intelligence.html&sa=U&ved=0ahUKEwi1y7KAt__cAhWpilQKHZQXBi04AhCpAgg4KAAwCQ&usg=AOvVaw1thfiF9hJWhz88BU8znvnD
                                                                                                                                                                    summary
1                                  China is littered with the virtual carcasses of startups that attempted to do business in the country and then gave up or were shut out.
2                       China's hot real estate market remains a challenge for authorities trying to maintain stable economic growth in the face of trade tensions with ...
3  The projects were a $20 billion rail link and two gas pipelines worth $2.3 billion. All three were part of China's Belt and Road Initiative (BRI), a massive project ...
4                            A new report [in Chinese] issued by the China Internet Network Information Center (CNNIC) put the number of people in China with access to ...
5                             China is Iran's biggest oil customer and the shift shows the communist nation wants to keep buying Iranian crude oil despite US sanctions ...
6                          Western media seized on a new Pentagon report that Chinese bombers are training to strike deep into the Western Pacific, including Guam, the ...
7                             Chinese markets led gains on Tuesday in a mostly positive trading session across Asia, extending their upward climb from the previous day ...
8                        As of today, ASF has been reported at sites in four provinces in China's northeast, thousands of kilometers apart. Containing the disease in a ...
9                           China's 10,000-ton medical ship, the Peace Ark, has cut a broad arc through the Pacific, stopping off in Papua New Guinea, Vanuatu and Fiji ...
10                       Living in China in the early 2000s changed my perspective. I saw firsthand that the outside world's view--China was good at copying but bad at ...
> 

URL的网页测试

enter image description here

Nb。注意,对于登录用户,通过网页的不同用户的结果顺序将有所不同。

引用:

Jinseog Kim-东国大学应用统计系副教授。他于2003年在首尔国立大学统计系获得统计学博士学位。他的研究兴趣是与数据挖掘相关的主题,包括机器学习,大数据分析,网络数据分析。

演示文稿链接:http://datamining.dongguk.ac.kr/lectures/2016-2/bigdata/google.pdf