我正在使用nokogiri输出电影剧本,我希望能够对该输出进行字数统计。
我已经调整了“Getting viewable text words via Nokogiri”的答案,但在运行时,我在此行中收到ActionController::RoutingError (undefined method 'frequencies')
错误:
puts frequencies(content)
这是我正在运行的代码,我对Rails还是一个新手,但是我已经尽力清理代码以便于阅读:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'pp'
class NokogiriController < ApplicationController
page = 'http://www.imsdb.com/scripts/Authors-Anonymous.html'
doc = Nokogiri::HTML(open(page))
text = doc.css('b').remove
text = doc.css('pre')
content = text.to_s.scan(/\w+/)
puts content.length, content.uniq.length, content.uniq.sort[0..8]
def frequencies(content)
Hash[
content.group_by(&:downcase).map{ |word, instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
puts frequencies(content)
end
答案 0 :(得分:1)
让我们来看看你在做什么:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))
doc.css('b').remove
text = doc.css('pre')
text
# => [#<Nokogiri::XML::Element:0x3ff6686df65c name="pre" children=[#<Nokogiri::XML::Text:0x3ff6686df440 "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686def7c "\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686deb1c "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de694 "\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de20c ...
text.to_s
# => "<pre>\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where...
text.to_s.scan(/\w+/)
# => ["pre", "Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "H...
您正在捕获标记,参数以及嵌入文本作为NodeSet,AKA,一组节点。我认为你不想这样做。
相反,我会做这样的事情:
require 'nokogiri'
require 'open-uri'
def frequencies(content)
Hash[
content.group_by(&:downcase).map{ |word, instances|
[word,instances.length]
}.sort_by(&:last).reverse
]
end
doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))
doc.css('b').remove
text = doc.css('pre').map(&:text)
text
# => ["\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where whe...
text.join(' ')
# => "\r\n\r\n\r\n\r\n\r\n\r\n Written by\r\n\r\n David Congalton\r\n\r\n\r\n\r\n\r\n July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n lined residential street. Note the small apartment complex\r\n set back from the curb.\r\n\r\n\r\n Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n This is where wher...
content = text.join(' ').scan(/\w+/)
# => ["Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "His", "w...
frequencies(content)
# => {"the"=>827, "to"=>486, "i"=>398, "a"=>397, "s"=>284, "and"=>279, "in"=>273, "of"=>238, "hannah"=>234, "you"=>232, "henry"=>223, "it"=>214, "on"=>207, "her"=>200, "is"=>192, "his"=>178, "he"=>165, "for"=>162, "t"=>152, "that"=>151, "colette"=>148, "she"=>142, "at"=>137, "john"=>133, "alan"=>118, "this"=>112, "my"=>109, "up"=>105, "all"=>88, "william"=>88, "as"=>85, "what"=>84, "with"=>84, "but"=>83, "be"=>76, "camera"=>76, "not"=>74, "one"=>74, "can"=>73, "out"=>70, "m"=>69, "from"=>...
我插入了一些额外的步骤,以便您可以更轻松地查看返回的内容。你可以忽略这些。
我们的想法是忽略这些标记,除了使用它们来获取文本内容,这是map(&:text)
所做的。
需要注意的事项:
\w
并不代表[a-z0-9]
,它意味着[a-z0-9_]
匹配变量名称,而不是我们认为的典型单词。 reject
删除所有数字条目可能会很好,因为在确定关键字等时这些条目通常不是很有用。