我想网页抓取this网站
请注意,我在右上角选择了一个特定的日期。
按照this指南
我写了以下代码
library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'
webpage_nba <- read_html(url_nba)
#Using CSS selectors to scrap the rankings section
data_nba <- html_nodes(webpage_nba,'#standings-table')
#Converting the ranking data to text
data_nba <- html_text(data_nba)
write.csv(data_nba,"web scraping test.csv")
根据我的理解,我想要获得的数字(例如,对于勇士来说,它将是94%,79%,66%,59%)以不同的方式“编码”。换句话说,web scraping test.csv
中写的内容是不可读的。
有什么方法可以将“编码数字”转换为“常规数字”吗?
答案 0 :(得分:4)
我尝试使用rvest
解析数据,但似乎这里的挑战性问题是点击下拉菜单,由HTML结构中的<select>
标记表示。所以我装备了重炮 - RSelenium
这是浏览器模拟器。使用它一切都变得简单了,这要归功于SO上的answer:
library(RSelenium)
library(rvest)
url_nba <- 'https://projects.fivethirtyeight.com/2017-nba-predictions/'
#initiate RSelenium. If it doesn't work, try other browser engines
rD <- rsDriver(port=4444L,browser="firefox")
remDr <- rD$client
#navigate to main page
remDr$navigate(url_nba)
#find the box and click option 10 (April 14 before playoffs)
webElem <- remDr$findElement(using = 'xpath', value = "//*[@id='forecast-selector']/div[2]/select/option[10]")
webElem$clickElement()
# Save html
webpage <- remDr$getPageSource()[[1]]
# Close RSelenium
remDr$close()
rD[["server"]]$stop()
# Select one of the tables and get it to dataframe
webpage_nba <- read_html(webpage) %>% html_table(fill = TRUE)
df <- webpage_nba[[3]]
# Clear the dataframe
names(df) <- df[3,]
df <- tail(df,-3)
df <- head(df,-4)
df <- df[ , -which(names(df) == "NA")]
df
ELO Carm-ELO 1-Week Change Team Conf. Conf. Semis Conf. Finals Finals Win Title
4 1770 1792 -14 Warriors West 94% 79% 66% 59%
5 1661 1660 -43 Spurs West 90% 62% 15% 11%
6 1600 1603 +33 Raptors East 77% 47% 25% 5%
7 1636 1640 +33 Clippers West 58% 11% 7% 5%
8 1587 1589 -22 Celtics East 70% 42% 24% 4%
9 1587 1584 -9 Wizards East 79% 38% 21% 4%
10 1617 1609 +16 Jazz West 42% 7% 5% 3%
11 1602 1606 -18 Rockets West 70% 27% 5% 3%
12 1545 1541 -22 Cavaliers East 59% 27% 11% 2%
13 1519 1523 +25 Bulls East 30% 15% 7% <1%
14 1526 1520 +37 Pacers East 41% 17% 6% <1%
15 1563 1564 +6 Trail Blazers West 6% 3% 1% <1%
16 1543 1537 -20 Thunder West 30% 8% <1% <1%
17 1502 1502 -3 Bucks East 23% 9% 3% <1%
18 1479 1469 +46 Hawks East 21% 6% 2% <1%
19 1482 1480 -41 Grizzlies West 10% 3% <1% <1%
20 1569 1555 +32 Heat East — — — —
21 1552 1533 +27 Nuggets West — — — —
22 1482 1489 -12 Pelicans West — — — —
23 1463 1472 -18 Timberwolves West — — — —
24 1463 1462 -40 Hornets East — — — —
25 1441 1436 +22 Pistons East — — — —
26 1420 1421 -20 Mavericks West — — — —
27 1393 1395 -2 Kings West — — — —
28 1374 1379 -13 Knicks East — — — —
29 1367 1370 +47 Lakers West — — — —
30 1372 1370 -14 Nets East — — — —
31 1352 1355 -9 Magic East — — — —
32 1338 1348 -29 76ers East — — — —
33 1340 1337 +26 Suns West — — — —
如果要解析其他时间段,请使用浏览器的开发工具检查页面HTML中的选项值。
答案 1 :(得分:1)
感谢@Alexey的回答和this,以下代码对我有用
$pdf = \PDF::loadView('invoices.show_invoice', $data);
return $pdf->download('invoice.pdf');