使用R来刮擦雅虎财经的头条新闻和日期

时间:2016-03-29 09:43:29

标签: html r web-scraping yahoo-finance

我正试图通过雅虎财经网页上的R搜索新闻来构建一个包含两列的表:日期和新闻标题。 按照here的说明,我正确创建了一个包含新闻标题的专栏;下一步是获取日期并将其作为列添加到表中。

我想我只需要修改这个命令:

def split_by_n( seq, n ):
"""A generator to divide a sequence into chunks of n units."""
    while seq:
        yield seq[:n]
        seq = seq[n:]

#print list(split_by_n("1234567890",2))
input=list("ZPFKYLGJPNSGNMQGFGCITLVRIWMGFBLBFDSIOAJGBGAVFVHBGLFSRPNIOFSYOBTFCGRQLWWZAAJFUPGAFZSNXLTGARUVFKOLGAIWGUUCMVSEKLIAGJGGUZFBAOILVRIZPORNXWVFRGNMEGCEUNUZSPNIUAHFRQLWALHWEQGQKDFDCCKLUZWFSITKWIKLSMUQKNJUWRTKZAHJGABKDEGEMNCVIMBFRNYXSSKYPWLWHUKKISHFAJPOOFGJBJTBXXSGTRYAJGBNRMYHOGXQBLSFEWVUCHRLEJWAQBIWFRLWSSKRKSBFRAKDFJVRGZUOCJUZEKWAPIQSBRYM")
l = list(split_by_n(input,6))

for i in range(len(l[-2])-len(l[-1])):
    l[-1].append('$')

print l

以获取日期而不是标题,例如,此代码:

out_dt <- xpathSApply(d, "//ul[contains(@class,'newsheadlines')]/following::ul/li/a", xmlValue)

有什么建议吗?

1 个答案:

答案 0 :(得分:3)

您可以按如下方式使用rvest

require(rvest)
doc <- read_html("http://finance.yahoo.com/q/h?s=AAPL+Headlines")
scope <- doc %>% html_nodes("#yfncsumtab li")
res <- lapply(scope, function(li){
  data.frame(stringsAsFactors = FALSE,
    date = li %>% html_node("cite span") %>% html_text,
    headline = li %>% html_node("a") %>% html_text
    )
})
do.call(rbind, res)

这会给你:

                date                                                                                  headline
1   (Tue 3:49AM EDT)                                   US hacks iPhone, ends legal battle but questions linger
2   (Tue 1:27AM EDT)                           Amazon Echo turns into a sleeper hit, offsetting Fire's failure
3   (Tue 1:00AM EDT)                                       Why Everyone Loses in Apple’s Fight Against the FBI
4  (Tue 12:36AM EDT) [$$] US drops Apple case, Japan's negative rate bounty and the criminals paid not to kill
5  (Tue 12:25AM EDT)                              U.S. succeeds in cracking Apple's iPhone, drops legal action
6  (Tue 12:00AM EDT)  [$$] Brussels Attacks: Belgium Turns to U.S. for Help in Scouring Seized Laptops, Phones
7      (Mon, Mar 28)                [$$] FBI Opens San Bernardino Shooter’s iPhone; U.S. Drops Demand on Apple
8      (Mon, Mar 28)                                              Wolverton: Encyption debate isn't going away
9      (Mon, Mar 28)                                            [$$] US drops Apple case after cracking iPhone
10     (Mon, Mar 28)         Words of warning — not celebration — in Silicon Valley after FBI ends Apple fight
11     (Mon, Mar 28)                               [$$] FBI Opens Shooter's iPhone; U.S. Drops Demand on Apple
12     (Mon, Mar 28)                                           FBI hacks into terrorist’s iPhone without Apple
13     (Mon, Mar 28)                                  Justice Department cracks iPhone; withdraws legal action
14     (Mon, Mar 28)                                Apple responds: 'This case should have never been brought'
15     (Mon, Mar 28)                           IPhone Security Is the Casualty in Apple's Victory Over the FBI
16     (Mon, Mar 28)                           Cracked Apple iPhone By F.B.I. Puts Spotlight On Apple Security
17     (Mon, Mar 28)                                    DOJ Drops Apple Case: Bloomberg West (Full Show 03/28)
18     (Mon, Mar 28)                                          Apple, Inc.'s New iPhone SE: Off to a Big Start?
19     (Mon, Mar 28)                                               AP Explains: Apple vs. FBI _ What Happened?
20     (Mon, Mar 28)                                                  PRESS DIGEST- Financial Times - March 29

我确实将日期解析留给你了。

另一种选择是从h3标题中取出日期如下

require(rvest)
doc <- read_html("http://finance.yahoo.com/q/h?s=AAPL+Headlines")
scope <- doc %>% html_nodes("#yfncsumtab")
dates <- scope %>% html_nodes("h3 span") %>% html_text()
headlines <- scope %>% html_nodes("h3 + ul") %>% lapply(. %>% html_nodes("li a") %>% html_text)

# combine both
do.call(rbind,Map(cbind, dates, headlines))

这导致以下矩阵

      [,1]                      [,2]                                                                                       
 [1,] "Tuesday, March 29, 2016" "March 29 Premarket Briefing: 10 Things You Should Know"                                   
 [2,] "Tuesday, March 29, 2016" "You might soon be able to pay for goods in-store using Facebook Messenger"                
 [3,] "Tuesday, March 29, 2016" "FBI unlocks iPhone"                                                                       
 [4,] "Tuesday, March 29, 2016" "US hacks iPhone, ends legal battle but questions linger"                                  
 [5,] "Tuesday, March 29, 2016" "Amazon Echo turns into a sleeper hit, offsetting Fire's failure"                          
 [6,] "Tuesday, March 29, 2016" "Why Everyone Loses in Apple’s Fight Against the FBI"                                      
 [7,] "Tuesday, March 29, 2016" "[$$] US drops Apple case, Japan's negative rate bounty and the criminals paid not to kill"
 [8,] "Tuesday, March 29, 2016" "U.S. succeeds in cracking Apple's iPhone, drops legal action"                             
 [9,] "Tuesday, March 29, 2016" "[$$] Brussels Attacks: Belgium Turns to U.S. for Help in Scouring Seized Laptops, Phones" 
[10,] "Monday, March 28, 2016"  "[$$] FBI Opens San Bernardino Shooter’s iPhone; U.S. Drops Demand on Apple"               
[11,] "Monday, March 28, 2016"  "Wolverton: Encyption debate isn't going away"                                             
[12,] "Monday, March 28, 2016"  "[$$] US drops Apple case after cracking iPhone"                                           
[13,] "Monday, March 28, 2016"  "Words of warning — not celebration — in Silicon Valley after FBI ends Apple fight"        
[14,] "Monday, March 28, 2016"  "[$$] FBI Opens Shooter's iPhone; U.S. Drops Demand on Apple"                              
[15,] "Monday, March 28, 2016"  "FBI hacks into terrorist’s iPhone without Apple"                                          
[16,] "Monday, March 28, 2016"  "Justice Department cracks iPhone; withdraws legal action"                                 
[17,] "Monday, March 28, 2016"  "Apple responds: 'This case should have never been brought'"                               
[18,] "Monday, March 28, 2016"  "IPhone Security Is the Casualty in Apple's Victory Over the FBI"                          
[19,] "Monday, March 28, 2016"  "Cracked Apple iPhone By F.B.I. Puts Spotlight On Apple Security"                          
[20,] "Monday, March 28, 2016"  "DOJ Drops Apple Case: Bloomberg West (Full Show 03/28)"  

同样在第二种情况下,我将日期解析留给你