R中的网络抓取?

时间:2017-07-26 08:43:10

标签: html r web-scraping rvest

我想网页抓取this网站

我特别希望获取该表中的信息:enter image description here

请注意,我在右上角选择了一个特定的日期。

按照this指南

我写了以下代码

library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'

webpage_nba <- read_html(url_nba)

#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')

#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")

根据我的理解,我想要获得的数字(例如,对于勇士来说,它将是94%,79%,66%,59%)以不同的方式“编码”。换句话说,web scraping test.csv中写的内容是不可读的。

有什么方法可以将“编码数字”转换为“常规数字”吗?

2 个答案:

答案 0 :(得分:4)

我尝试使用rvest解析数据,但似乎这里的挑战性问题是点击下拉菜单,由HTML结构中的<select>标记表示。所以我装备了重炮 - RSelenium这是浏览器模拟器。使用它一切都变得简单了,这要归功于SO上的answer

library(RSelenium)
library(rvest)

url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'


#initiate RSelenium. If it doesn't work, try other browser engines
rD <- rsDriver(port=4444L,browser="firefox")
remDr <- rD$client

#navigate to main page
remDr$navigate(url_nba)

#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()

# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
rD[["server"]]$stop() 

# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]

# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]

df

    ELO Carm-ELO 1-Week Change          Team Conf. Conf. Semis Conf. Finals Finals Win Title
4  1770     1792           -14      Warriors  West         94%          79%    66%       59%
5  1661     1660           -43         Spurs  West         90%          62%    15%       11%
6  1600     1603           +33       Raptors  East         77%          47%    25%        5%
7  1636     1640           +33      Clippers  West         58%          11%     7%        5%
8  1587     1589           -22       Celtics  East         70%          42%    24%        4%
9  1587     1584            -9       Wizards  East         79%          38%    21%        4%
10 1617     1609           +16          Jazz  West         42%           7%     5%        3%
11 1602     1606           -18       Rockets  West         70%          27%     5%        3%
12 1545     1541           -22     Cavaliers  East         59%          27%    11%        2%
13 1519     1523           +25         Bulls  East         30%          15%     7%       <1%
14 1526     1520           +37        Pacers  East         41%          17%     6%       <1%
15 1563     1564            +6 Trail Blazers  West          6%           3%     1%       <1%
16 1543     1537           -20       Thunder  West         30%           8%    <1%       <1%
17 1502     1502            -3         Bucks  East         23%           9%     3%       <1%
18 1479     1469           +46         Hawks  East         21%           6%     2%       <1%
19 1482     1480           -41     Grizzlies  West         10%           3%    <1%       <1%
20 1569     1555           +32          Heat  East           —            —      —         —
21 1552     1533           +27       Nuggets  West           —            —      —         —
22 1482     1489           -12      Pelicans  West           —            —      —         —
23 1463     1472           -18  Timberwolves  West           —            —      —         —
24 1463     1462           -40       Hornets  East           —            —      —         —
25 1441     1436           +22       Pistons  East           —            —      —         —
26 1420     1421           -20     Mavericks  West           —            —      —         —
27 1393     1395            -2         Kings  West           —            —      —         —
28 1374     1379           -13        Knicks  East           —            —      —         —
29 1367     1370           +47        Lakers  West           —            —      —         —
30 1372     1370           -14          Nets  East           —            —      —         —
31 1352     1355            -9         Magic  East           —            —      —         —
32 1338     1348           -29         76ers  East           —            —      —         —
33 1340     1337           +26          Suns  West           —            —      —         —

如果要解析其他时间段,请使用浏览器的开发工具检查页面HTML中的选项值。

答案 1 :(得分:1)

感谢@Alexey的回答和this,以下代码对我有用

$pdf = \PDF::loadView('invoices.show_invoice', $data);
return $pdf->download('invoice.pdf');