Question

我正在抓取特定div类中出现的网站上的所有文字。在下面的例子中，我想要提取所有类别中的所有内容＆＃34; a＆＃34;。

site <- "<div class='a'>Hello, world</div>
  <div class='b'>Good morning, world</div>
  <div class='a'>Good afternoon, world</div>"

我想要的输出是......

"Hello, world"
"Good afternoon, world"

下面的代码从每个div中提取文本，但我无法弄清楚如何仅包含class =＆＃34; a＆＃34;。

library(tidyverse)
library(rvest)

site %>% 
  read_html() %>% 
  html_nodes("div") %>% 
  html_text()

# [1] "Hello, world"          "Good morning, world"   "Good afternoon, world"

使用Python的BeautifulSoup，它看起来像site.find_all("div", class_="a")。

Answer 1

subject的CSS选择器为div with class = "a"：

div.a

或者您可以使用XPath：

site %>% 
  read_html() %>% 
  html_nodes("div.a") %>% 
  html_text()

Answer 2

site %>% 
  read_html() %>% 
  html_nodes(xpath = '//*[@class="a"]') %>% 
  html_text()

使用特定类刮取所有div标记的内容

2 个答案: