Question

我正在使用nokogiri输出电影剧本，我希望能够对该输出进行字数统计。

我已经调整了“Getting viewable text words via Nokogiri”的答案，但在运行时，我在此行中收到ActionController::RoutingError (undefined method 'frequencies')错误：

puts frequencies(content)

这是我正在运行的代码，我对Rails还是一个新手，但是我已经尽力清理代码以便于阅读：

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'pp'

class NokogiriController < ApplicationController
  page = 'http://www.imsdb.com/scripts/Authors-Anonymous.html'
  doc = Nokogiri::HTML(open(page))

  text = doc.css('b').remove
  text = doc.css('pre')

  content = text.to_s.scan(/\w+/)
  puts content.length, content.uniq.length, content.uniq.sort[0..8]

  def frequencies(content)
    Hash[
      content.group_by(&:downcase).map{ |word, instances|
        [word,instances.length]
        }.sort_by(&:last).reverse
      ]
  end

  puts frequencies(content)
end

Answer 1

让我们来看看你在做什么：

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))

doc.css('b').remove
text = doc.css('pre')
text 
# => [#<Nokogiri::XML::Element:0x3ff6686df65c name="pre" children=[#<Nokogiri::XML::Text:0x3ff6686df440 "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686def7c "\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686deb1c "\r\n\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de694 "\r\n\r\n">, #<Nokogiri::XML::Text:0x3ff6686de20c ...

text.to_s 
# => "<pre>\r\n\r\n\r\n\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n    lined residential street. Note the small apartment complex\r\n    set back from the curb.\r\n\r\n\r\n    Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n               This is where...

text.to_s.scan(/\w+/) 
# => ["pre", "Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "H...

您正在捕获标记，参数以及嵌入文本作为NodeSet，AKA，一组节点。我认为你不想这样做。

相反，我会做这样的事情：

require 'nokogiri'
require 'open-uri'

def frequencies(content)
  Hash[
    content.group_by(&:downcase).map{ |word, instances|
      [word,instances.length]
      }.sort_by(&:last).reverse
    ]
end

doc = Nokogiri::HTML(open('http://www.imsdb.com/scripts/Authors-Anonymous.html'))

doc.css('b').remove
text = doc.css('pre').map(&:text)
text 
# => ["\r\n\r\n\r\n\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n    lined residential street. Note the small apartment complex\r\n    set back from the curb.\r\n\r\n\r\n    Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n               This is where whe...

text.join(' ')
# => "\r\n\r\n\r\n\r\n\r\n\r\n                          Written by\r\n\r\n                       David Congalton\r\n\r\n\r\n\r\n\r\n                                                       July 14 2012\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n    North Hayworth Avenue, off Sunset Boulevard. A quiet, tree-\r\n    lined residential street. Note the small apartment complex\r\n    set back from the curb.\r\n\r\n\r\n    Our narrator is HENRY OBERT (O-BURT)(30).\r\n\r\n               This is where wher...

content = text.join(' ').scan(/\w+/) 
# => ["Written", "by", "David", "Congalton", "July", "14", "2012", "North", "Hayworth", "Avenue", "off", "Sunset", "Boulevard", "A", "quiet", "tree", "lined", "residential", "street", "Note", "the", "small", "apartment", "complex", "set", "back", "from", "the", "curb", "Our", "narrator", "is", "HENRY", "OBERT", "O", "BURT", "30", "This", "is", "where", "where", "F", "Scott", "Fitzgerald", "died", "on", "December", "21", "1940", "INSERT", "ARCHIVAL", "PHOTOS", "of", "Fitzgerald", "His", "w...

frequencies(content)
# => {"the"=>827, "to"=>486, "i"=>398, "a"=>397, "s"=>284, "and"=>279, "in"=>273, "of"=>238, "hannah"=>234, "you"=>232, "henry"=>223, "it"=>214, "on"=>207, "her"=>200, "is"=>192, "his"=>178, "he"=>165, "for"=>162, "t"=>152, "that"=>151, "colette"=>148, "she"=>142, "at"=>137, "john"=>133, "alan"=>118, "this"=>112, "my"=>109, "up"=>105, "all"=>88, "william"=>88, "as"=>85, "what"=>84, "with"=>84, "but"=>83, "be"=>76, "camera"=>76, "not"=>74, "one"=>74, "can"=>73, "out"=>70, "m"=>69, "from"=>...

我插入了一些额外的步骤，以便您可以更轻松地查看返回的内容。你可以忽略这些。

我们的想法是忽略这些标记，除了使用它们来获取文本内容，这是map(&:text)所做的。

需要注意的事项：

\w并不代表[a-z0-9]，它意味着[a-z0-9_]匹配变量名称，而不是我们认为的典型单词。
纯数字的值，例如“14”和“2012”会不必要地混淆结果。使用reject删除所有数字条目可能会很好，因为在确定关键字等时这些条目通常不是很有用。

使用Nokogiri读取和计算单词输出：未定义的方法

1 个答案: