您能帮我吗(i)整理代码,以便可以将2条正则表达式行合而为一,并且(ii)标题可以放在条形图上(或成一定角度),以便它们都适合
## Web Scraping DUP
## to plot a barchart of poem reads
## Download the webpage
NoPoemsRead <- function(x){
poems <- readLines(x)
## Extract out the title lines
poem_title <- poems[grep("<h2", poems)[-1]]
## Extract out the number of reads lines
poem_reads <- poems[grep("reads</small", poems)]
## De-clutter work space
rm(poems)
## tidy up the data
poem_reads <- unlist(lapply(poem_reads, function(x) as.numeric(gsub("[^0-9]+","",x))))
## the two lines below could be done in one
poem_title <- unlist(lapply(poem_title, function(x) gsub("\t\t\t\t\t\t\t\t\t\t<.*?>", "", x)))
poem_title <- unlist(lapply(poem_title, function(x) gsub("<.*?>", "", x)))
names(poem_reads) <- poem_title
barplot(poem_reads, las = 2,cex.names = 0.75, main = "The Number of Poems Read",
ylab="Number of Reads", col = "skyblue")
}
NoPoemsRead("https://deepundergroundpoetry.com/poems-by/AnonymousBystander/")
答案 0 :(得分:1)
如果使用XML处理,这会更干净:
library(magrittr)
library(xml2)
u <- "https://deepundergroundpoetry.com/poems-by/AnonymousBystander/"
doc <- read_html(u)
titles <- doc %>%
xml_find_all(".//h2") %>%
xml_text %>%
.[-1]
numbers <- doc %>%
xml_find_all(".//small") %>%
xml_text %>%
grep("reads.*", ., value = TRUE) %>%
sub("reads.*", "", .) %>%
as.numeric
barplot(numbers, names.arg = titles, las = 2, cex.names = 0.75,
main = "The Number of Poems Read", ylab="Number of Reads", col = "skyblue")