目标是刮除Twitter的多个Tweet,他们的喜欢等。我不知何故找不到一种方法来处理多个不同的Tweet,因为一条Tweet完美地起作用了。
我已经在R中设置了单个推文的抓取功能。代码粘贴在下面。但是,我无法在多个站点上实现该功能。
site <- "https://twitter.com/btspavedyou/status/1146055736130019334"
page <- read_html(site)
handles <- page %>%
html_nodes(".js-action-profile") %>%
html_text() %>%
sub(".*@", "", .) %>%
print()
text_new <- page %>%
html_nodes("p.TweetTextSize") %>%
html_text() %>%
print()
time <- page %>%
html_nodes("._timestamp") %>%
html_text() %>%
print()
all_data_tweet <- data.frame(
page=site,
author=handles,
text=text_new,
time=time
)
all_data_tweet
现在在以下十页中尝试相同时,它将不起作用(尝试外观并与功能关联应用。
multiple_pages <- c("https://twitter.com/Swiftandoned/status/1146494919344717824", "https://twitter.com/Swiftandoned/status/1146149790016688128","https://twitter.com/baylee_corbello/status/1146494887875022854","https://twitter.com/angiegon00/status/1146494850486820864", "https://twitter.com/gallica_/status/1146494826289999872", "https://twitter.com/RomuHDV/status/1146494814604673029","https://twitter.com/mathebula_boity/status/1146494779666178049","https://twitter.com/mathebula_boity/status/1146487751774285825","https://twitter.com/mathebula_boity/status/1146494417697681408","https://twitter.com/mathebula_boity/status/1146494307324575744")
结果应该是我为一条推文生成的内容是为多个推文生成的:
page author text time
1 https://twitter.com/btspavedyou/status/1146055736130019334 KPOP_predict18 Sehun and Jisoo together in a drama, 2020. 2. Juli
2 https://twitter.com/btspavedyou/status/1146055736130019334 na1_27 Well i guess there is nothing about iKON AND HANBIN 2. Juli
3 https://twitter.com/btspavedyou/status/1146055736130019334 btspavedyou I'm sure he is 'okay' 2. Juli
4 https://twitter.com/btspavedyou/status/1146055736130019334 na1_27 I really hope so, thank you 2. Juli
答案 0 :(得分:1)
有一些方法可以解决,但是我会使用bind_rows
中的dplyr
做一些小的修改:
readTweet <-function(url){
page <- read_html(url)
handles <- page %>%
html_nodes(".js-action-profile") %>%
html_text() %>%
sub(".*@", "", .)
text_new <- page %>%
html_nodes("p.TweetTextSize") %>%
html_text()
time <- page %>%
html_nodes("._timestamp") %>%
html_text()
all_data_tweet <- data.frame(
page = url,
author = handles,
text = text_new,
time = time
)
return(all_data_tweet)
}
df <- bind_rows(
lapply(list_of_urls, readTweet)
)
您无需创建.id,因为您具有页面网址作为列。