网站:https://www.goodreads.com/book/show/985873.A_Game_of_Thrones
书评旁边有一个“评等详细信息”按钮。我试图找到css选择器来获取整个表的文本,但是我没有成功。有人能帮我吗?没有错误,只是没有得到我想要的文字
代码:
book_url <-
read_html("https://www.goodreads.com/book/show/985873.A_Game_of_Thrones")
book_url %>%
html_node("table#rating_distribution") %>%
html_text()
答案 0 :(得分:0)
数据在CData屏蔽的脚本标签内。您可以将所需的html进行正则表达式,然后使用html解析器重新解析。
library(rvest)
library(magrittr)
library(stringr)
library(stringi)
t <- read_html('https://www.goodreads.com/book/show/985873.A_Game_of_Thrones') %>% html_text()
y <- gsub('\n|\\s+',' ',stri_unescape_unicode(t[[1]][1]))
z <- str_match_all(y,'rating_details_tip\'\\), "(.*)", \\{')
tables <- read_html(z[[1]][,2]) %>% html_nodes("table") %>% html_table(fill=T)
table1 <- data.frame(tables[1]) %>% subset(., select=-c(2))
table2 <- data.frame(tables[2])
如果有兴趣,我会先用python编写
import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://www.goodreads.com/book/show/985873.A_Game_of_Thrones')
p = re.compile(r'rating_details\'\), "(.*)", {')
s = p.findall(r.text)[0].encode().decode('unicode_escape')
s = re.sub(r'\n+\s+|\\','',s )
soup = bs(s, 'lxml')
dfs = pd.read_html(str(soup.select('table')))
print(dfs)