Question

我正在尝试查找当前的html_note，以获取该论坛https://d.cosx.org/中每个帖子的回复数。我使用CSS选择器，它说.DiscussionListItem-count，但似乎不起作用。

我的代码：

library(rvest)
library(tidyverse)
COS_link <- read_html("https://d.cosx.org/")
COS_link %>% 
  # The relevant tag
  html_nodes(css = '.DiscussionListItem-count') %>%      
  html_text()

我想获取回复计数，例如：1k用于第一篇帖子，30k用于第二篇帖子。我想知道我是否想念某件事，或者有人有更好的主意？

Answer 1

您可以使用API并为title和participantCount属性解析json响应

返回该信息的API端点为：

https://d.cosx.org/api

为响应添加字符串，以删除结尾的0和前导ac76，然后使用选择的json库进行解析。

最佳选择是从原始网址中regex取出json字符串

library(rvest)
library(jsonlite)
library(stringr)

url <- "https://d.cosx.org/"

r <- read_html(url) %>% 
  html_nodes('body') %>% 
      html_text() %>% 
      toString()

x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')  
json <- jsonlite::fromJSON(x[[1]][,2])
counts <- json$resources$attributes$participantCount

对于那些希望将标题与count配对并且没有中文设置的人，一位同事帮助我编写了以下内容：

library(rvest)
library(jsonlite)
library(stringr)
library(corpus)

url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()

x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
titles <- json$resources$attributes$title
counts <- json$resources$attributes$participantCount
cf <- corpus_frame(name = titles, text = counts)
names(cf) <- c("titles", "counts")

print(cf[which(!is.na(cf$counts)),], 100)

使用rvest在R中进行Web爬网并找到html_note

1 个答案: