Question

我有一个网址(https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine)用于抓取帖子。这些帖子中有一些是回复，其初始文本为“最初由...发布”。我想抓取帖子中的所有数据，但不包括最初由文本发布的初始数据。例如，

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Originally posted by C Heuwi 
      Hellou
 E    Hello guys
 F    Originally posted by A Hi, how are you ?
      I am doing good
 G    Whats going on ?

对于用户D，“原始发布者..”位于div.quote_container类（子类）下，“我做得很好”位于blockquote.postcontent.restore下，这是父类。

预期结果：

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Hellou
 E    Hello guys
 F    I am doing good
 G    Whats going on ?

我尝试了以下代码：

url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())

也尝试了其他几个：

threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())

或

threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)

这些都不起作用。请帮助我找到一种方法来抓取所有帖子数据（不包括子类别）。在此先感谢！

Answer 1

使用xml_remove函数（这是xml2库的一部分（随rvest自动加载）），这实际上是一个相对简单的解决方案

library(rvest)
#read page
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)

#find parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#find children nodes to exclude
toremove<-threads %>% html_node("div.bbcode_container")
#remove nodes
xml_remove(toremove)

#convert the parent nodes to text
threads %>% html_text(trim=TRUE)

摘自xml_remove的文档：“ 在使用xml_remove（）时需要注意”。请查看，谨慎使用并经常保存。

是否可以使用Rvest刮除html节点内的子类以外的数据？

1 个答案: