在Ruby中计算文本文件中给定单词的频率

时间:2017-04-28 00:59:19

标签: ruby word find-occurrences

我希望能够计算文本文件中给定单词(例如输入)的出现次数。我有这个代码,它让我发现文件中的所有单词:

word_count = {}
    my_word = id
    File.open("texte.txt", "r") do |f|
    f.each_line do |line|
    words = line.split(' ').each do |word|
      word_count[word] += 1 if word_count.has_key? my_word
      word_count[word] = 1 if not word_count.has_key? my_word
    end
  end
end

puts "\n"+ word_count.to_s

谢谢

2 个答案:

答案 0 :(得分:3)

创建测试文件

让我们首先创建一个可以使用的文件。

text =<<-BITTER_END
It was the best of times, it was the worst of times, it was the age of wisdom,
it was the age of foolishness, it was the epoch of belief, it was the epoch of
incredulity, it was the season of Light, it was the season of Darkness, it was
the spring of hope, it was the winter of despair, we had everything before us,
we had nothing before us...
BITTER_END

FName = 'texte.txt'
File.write(FName, text)
  #=> 344

指定要计算的字词

target = 'the'

创建正则表达式

r = /\b#{target}\b/i
  #=> /\bthe\b/i

单词分隔\b用于确保'anthem'不计入'the'

Gulp小文件

如果在这里,文件不是很大,你可以吞下它:

File.read("texte.txt").scan(r).count
  #=> 10

逐行阅读大文件

如果文件太大而我们想逐行阅读,请执行以下操作。

File.foreach(FName).reduce(0) { |cnt, line| cnt + line.scan(r).count }
  #=> 10

File.foreach(FName).sum { |line| line.scan(r).count }
  #=> 10

注意Enumerable#sum在Ruby v2.4中首次亮相。

请参阅IO::readIO::foreach。 (IO.methodx...通常是File.methodx...。这是允许的,因为FileIO的子类;即File < IO #=> true。)

使用gsub避免创建临时数组

第一种方法(吞咽文件)创建一个临时数组:

["the", "the", "the", "the", "the", "the", "the", "the", "the", "the"]

应用count(aka size)。避免创建此数组的一种方法是使用String#gsub而不是String#scan,因为前者在没有块的情况下使用时会返回一个枚举器:

File.read("texte.txt").gsub(r).count
  #=> 10

这也可以用于文件的每一行。

gsub

这是一种非传统但有时有用的用途

答案 1 :(得分:0)

如果您只想获取特定单词的计数,则无需使用DB_HOST=postgres DB_USER=user DB_PASS=pass DB_NAME=mydb ,例如:

version: '2'
services:
    app:
        build: .
        volumes:
          - ./:/server/http
        ports:
            - "80:8080"
        links:
            - postgres
            - mongodb
            - redis
        environment:
            DEBUG: 'true'
            PORT: '8080'
        env_file: 
          - docker.env

    postgres:
        image: onjin/alpine-postgres:9.5
        restart: unless-stopped
        ports:
            - "5432:5432"
        environment:
            LC_ALL: C.UTF-8
            POSTGRES_USER: user
            POSTGRES_PASSWORD: pass
            POSTGRES_DB: mydb

    mongodb:
        image: mvertes/alpine-mongo:3.2.3
        restart: unless-stopped
        ports:
            - "27017:27017"

    redis:
        image: sickp/alpine-redis:3.2.2
        restart: unless-stopped
        ports:
            - "6379:6379"

Hash将包含word_count = 0 my_word = "input" File.open("texte.txt", "r") do |f| f.each_line do |line| line.split(' ').each do |word| word_count += 1 if word == my_word end end end puts "\n" + word_count.to_s 的总出现次数。

另一方面,如果您想要保留所有单词的计数,然后只打印特定单词的计数,那么您可以使用word_count,但尝试这样的事情:

my_word

Hash将包含与总出现次数匹配的所有单词(单词为word_count = Hash.new(0) my_word = "input" File.open("texte.txt", "r") do |f| f.each_line do |line| line.split(' ').each do |word| word_count[word] += 1 end end end puts "\n" + word_count[my_word].to_s 并出现其);要打印word_count的出现次数,您只需要使用Hash作为密钥获取哈希值。