Question

我是 R 和 Webscraping 的新手。作为练习，我试图从一个假书网站上抓取信息。我已经设法抓取了书名，但现在我想找到书名中每个单词的平均字长。例如，如果有两本书“书籍示例”“随机书籍”，则平均字长将为 22/4 = 5.5。我目前能够找出整本书书名的平均长度，但我需要将它们全部拆分为单独的单词，然后找到平均长度。

代码：

url<-'http://books.toscrape.com/catalogue/page-1.html'

url %>%
  read_html() %>%
  html_nodes('h3 a') %>%
  html_attr('title')->titles
titles

values<-lapply(titles,nchar)
mean(unlist(values))

输出：

 [1] "A Light in the Attic"                                                                          
 [2] "Tipping the Velvet"                                                                            
 [3] "Soumission"                                                                                    
 [4] "Sharp Objects"                                                                                 
 [5] "Sapiens: A Brief History of Humankind"                                                         
 [6] "The Requiem Red"                                                                               
 [7] "The Dirty Little Secrets of Getting Your Dream Job"                                            
 [8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"       
 [9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"                                                                               
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"                                                
[12] "Shakespeare's Sonnets"                                                                         
[13] "Set Me Free"                                                                                   
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"                                       
[15] "Rip it Up and Start Again"                                                                     
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"            
[17] "Olio"                                                                                          
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"                                         
[19] "Libertarianism for Beginners"                                                                  
[20] "It's Only the Himalayas" 

[1] 35.35 # Current mean value (of full book title, but I want average word length)

有没有办法专注于每个单词并找到书名中所有单个单词的平均长度？提前致谢。

Answer 1

将 titles 拆分为单词并计算每个单词中的平均字符数。

mean(nchar(unlist(strsplit(titles, '\\s+'))))
#[1] 5.161017

请注意，由于我们在空白处进行拆分，因此包含 "1981-1991"、"(Scott"、"#1)" 等词，对于较大的样本应该没问题。如果您不想包含它们，您可能需要澄清构成一个词的要求。

有没有办法在 R 中找到字符串中单词的平均长度？

1 个答案: