我是 R 和 Webscraping 的新手。作为练习,我试图从一个假书网站上抓取信息。我已经设法抓取了书名,但现在我想找到书名中每个单词的平均字长。例如,如果有两本书“书籍示例”“随机书籍”,则平均字长将为 22/4 = 5.5。我目前能够找出整本书书名的平均长度,但我需要将它们全部拆分为单独的单词,然后找到平均长度。
代码:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
values<-lapply(titles,nchar)
mean(unlist(values))
输出:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
[1] 35.35 # Current mean value (of full book title, but I want average word length)
有没有办法专注于每个单词并找到书名中所有单个单词的平均长度?提前致谢。
答案 0 :(得分:2)
将 titles
拆分为单词并计算每个单词中的平均字符数。
mean(nchar(unlist(strsplit(titles, '\\s+'))))
#[1] 5.161017
请注意,由于我们在空白处进行拆分,因此包含 "1981-1991"
、"(Scott"
、"#1)"
等词,对于较大的样本应该没问题。如果您不想包含它们,您可能需要澄清构成一个词的要求。