我正在尝试在网站上抓取嵌入式推文。我相信该推文是通过JSON加载的。理想情况下,我将能够简单地抓取嵌入式tweet的ID。据我所知,该数据应该可以通过css选择器'#twitter-widget-0'获得,但是当我使用rvest进行抓取时,什么也不会返回。
我的代码如下:
page <- "https://deutsch.rt.com/amerika/86714-rund-woche-nach-russland-auch-china-schickt-militaer-nach-venezuela/"
read_html(page) %>%
html_nodes('#twitter-widget-0') %>%
html_text()
答案 0 :(得分:0)
类似的事情可能会帮助
library(dplyr)
library(rvest)
page %>%
read_html() %>%
html_nodes("div.rtcode") %>%
html_text()
#[1] "#Venezuela#China#Russia#Caracas#Chinese army soldiers arrived in
#Venezuela #Chinese People’s Liberation Army soldiers, as part of a
#cooperation program, #arrived, after delivering humanitarian supplies, to one
#of Venezuelan military #facilities. pic.twitter.com/HwZ9Ee67d0— Sukhoi Su-57
#frazor\U0001f1f7\U0001f1fa\U0001f1ee\U0001f1f3 (@I30mki) 1. April 2019"
或者如果您想要唯一的Twitter URL
page %>%
read_html() %>%
html_nodes("div.rtcode a") %>%
html_attr("href") %>%
grep("status", ., value = TRUE)
#[1] "https://twitter.com/I30mki/status/1112578904835981312?ref_src=twsrc%5Etfw"