需要帮助找到CSS选择器

时间:2019-08-24 16:52:10

标签: html web-scraping css-selectors rvest

网站:https://www.goodreads.com/book/show/985873.A_Game_of_Thrones

书评旁边有一个“评等详细信息”按钮。我试图找到css选择器来获取整个表的文本,但是我没有成功。有人能帮我吗?没有错误,只是没有得到我想要的文字

代码:

    book_url <- 
    read_html("https://www.goodreads.com/book/show/985873.A_Game_of_Thrones")

    book_url %>%
      html_node("table#rating_distribution") %>%
      html_text()

1 个答案:

答案 0 :(得分:0)

数据在CData屏蔽的脚本标签内。您可以将所需的html进行正则表达式,然后使用html解析器重新解析。

library(rvest)
library(magrittr)
library(stringr)
library(stringi)

t <- read_html('https://www.goodreads.com/book/show/985873.A_Game_of_Thrones') %>% html_text()
y <- gsub('\n|\\s+',' ',stri_unescape_unicode(t[[1]][1]))
z <- str_match_all(y,'rating_details_tip\'\\), "(.*)", \\{')
tables <- read_html(z[[1]][,2])  %>% html_nodes("table") %>% html_table(fill=T)
table1 <- data.frame(tables[1])  %>% subset(., select=-c(2))
table2 <- data.frame(tables[2])

enter image description here


如果有兴趣,我会先用python编写

import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('https://www.goodreads.com/book/show/985873.A_Game_of_Thrones')
p = re.compile(r'rating_details\'\), "(.*)", {')
s = p.findall(r.text)[0].encode().decode('unicode_escape')
s = re.sub(r'\n+\s+|\\','',s )
soup = bs(s, 'lxml')
dfs = pd.read_html(str(soup.select('table')))
print(dfs)