Question

假设我们想要计算文档中的单词数。我知道我们可以做到以下几点：

text.each_line(){ |line| totalWords = totalWords + line.split.size }

说，我只是想添加一些例外，这样，我不想将以下内容统计为单词：

（1）数字

（2）独立字母

（3）电子邮件地址

我们怎么能这样做？

感谢。

Answer 1

你可以非常巧妙地把它包起来：

regex_variable = /\d.|^[a-z]{1}$|\A([^@\s]+)@((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i

粗略地说，它定义了一个近似的电子邮件地址。

Answer 2

假设您可以在单个正则表达式regex_variable中表示所有异常，您可以这样做：

radioGroup

你的正则表达式可能类似于：

text.each_line(){ |line| totalWords = totalWords + line.split.count {|wrd| wrd !~ regex_variable }

我不是自称是正则表达式的专家，因此您可能需要仔细检查，尤其是email validation部分

Answer 3

除了其他答案之外，还有一点宝石狩猎with this：

WordsCounted Gem

从任何字符串或可读文件中获取以下数据：


字数

独特字数

字密度

字符数

每个字词的平均字符数

单词的哈希映射及其出现的次数

单词及其长度的哈希映射

最长的单词及其长度

最常出现的单词及其出现次数。

计算出现的各个字符串。

从计数中排除单词（或任何内容）的灵活方法。您可以传递字符串，正则表达式，数组或 lambda 。

可自定义的标准。如果您愿意，可以将自己的正则表达式规则传递给拆分字符串。默认的regexp有两个功能：

过滤特殊字符但尊重连字符和撇号。

与变音符号（UTF和unicode字符）很好地对应：“圣保罗”被视为["São", "Paulo"]而不是["S", "", "o", "Paulo"]。

打开并读取文件。传入文件路径或URL而不是字符串。

Answer 4

你有没有开始回答一个问题，发现自己在游荡，探索有趣的，但切向的问题，或者你不完全理解的概念？这就是我在这里发生的事情。如果不是针对手头的问题，也许某些想法在其他环境中可能会有用。

为了便于阅读，我们可能会在班级String中定义一些帮助，但为了避免污染，我将使用Refinements。

<强>代码

module StringHelpers
  refine String do
    def count_words
      remove_punctuation.split.count { |w|
        !(w.is_number? || w.size == 1 || w.is_email_address?) }
    end

    def remove_punctuation
      gsub(/[.!?,;:)](?:\s|$)|(?:^|\s)\(|\-|\n/,' ')
    end

    def is_number?
      self =~ /\A-?\d+(?:\.\d+)?\z/
    end

    def is_email_address?
      include?('@') # for testing only
    end
  end
end

module CountWords
   using StringHelpers

   def self.count_words_in_file(fname)
     IO.foreach(fname).reduce(0) { |t,l| t+l.count_words }
   end
end

请注意using必须位于模块中（可能是类）。它在main中不起作用，大概是因为这会使类self.class #=> Object中的方法可用，这会使Refinements的目的失效。（读者：如果我对using必须出现在模块中的原因我说错了，请纠正我。）

示例

让我们首先非正式地检查帮助者是否正常工作：

module CheckHelpers using StringHelpers s = "You can reach my dog, a 10-year-old golden, at fido@dogs.org." p s = s.remove_punctuation #=> "You can reach my dog a 10 year old golden at fido@dogs.org." p words = s.split #=> ["You", "can", "reach", "my", "dog", "a", "10", # "year", "old", "golden", "at", "fido@dogs.org."] p '123'.is_number? #=> 0 p '-123'.is_number? #=> 0 p '1.23'.is_number? #=> 0 p '123.'.is_number? #=> nil p "fido@dogs.org".is_email_address? #=> true p "fido(at)dogs.org".is_email_address? #=> false p s.count_words #=> 9 (`'a'`, `'10'` and "fido@dogs.org" excluded) s = "My cat, who has 4 lives remaining, is at abbie(at)felines.org." p s = s.remove_punctuation p s.count_words end

一切看起来都不错。接下来，我将把一些文本放在一个文件中：

FName = "pets" text =<<_ My cat, who has 4 lives remaining, is at abbie(at)felines.org. You can reach my dog, a 10-year-old golden, at fido@dogs.org. _ File.write(FName, text) #=> 125

并确认文件内容：

File.read(FName) #=> "My cat, who has 4 lives remaining, is at abbie(at)felines.org.\n # You can reach my dog, a 10-year-old golden, at fido@dogs.org.\n"

现在，算上几句：

CountWords.count_words_in_file(FName) #=> 18 (9 in ech line)

请注意，删除标点符号至少存在一个问题。它与连字符有关。知道那可能是什么吗？

Answer 5

像......一样的东西？

def is_countable(word)
  return false if word.size < 2
  return false if word ~= /^[0-9]+$/
  return false if is_an_email_address(word) # you need a gem for this...
  return true
end

wordCount = text.split().inject(0) {|count,word| count += 1 if is_countable(word) }

或者，由于我得出的结论是您可以将整个文本拆分为split()的数组，因此您可能需要：

wordCount = 0
text.each_line do |line|
  line.split.each{|word| wordCount += 1 if is_countable(word) }
end

使用Ruby计算单词中的单词

5 个答案:

WordsCounted Gem