如何在两个索引号之间保存文本?

时间:2013-12-17 15:16:04

标签: ruby loops hash indexing

我正在迭代许多文本文件,试图找到所有回车符,并在回车之间单独保存文本。我得到了所有回车的索引号,但我没有丝毫关于保存文本。

基本上我想将两个回车之间的每个字符串保存到一个单独的变量中。下一步是将字符串中的所有单词保存为单个散列。

到目前为止,这是我的代码(基于Tin Man和Screenmutt的帮助编辑),以便将文件的每个段落都放到一个数组中:

# script start

# outputfile
output_text = File.open("output.txt", 'w')

# directory with files
Dir.chdir("nkamp")

#count lines
lines = File.readlines("first.txt")
line_count = lines.size
text = lines.join
paragraph_count = text.split("\.\r").length
puts "#{paragraph_count} paragraphs."

#array of paragraphs
paragraphs = Array.new
contents = []
File.foreach("first.txt", "\.\r") do |paragraph|
  puts paragraph.chomp
  puts '-' * 40
  contents << paragraph.chomp
  paragraphs << paragraph.chomp
end

puts paragraphs[10]

这段代码给了我一个包含所有段落的数组。我使用"\.\r"而不是"\n\n",因为文本是从PDF文件复制的,并且已经丢失了正常的页面布局结构。

下一步是将段落中的单词数组保存到数组中,而不仅仅是一串文本:

words_in_each_paragraph = Array.new

File.foreach("Ann Reg Sci (2).txt", "\.\r") do |paragraph|
    word_hash = {}
    paragraph.split(/\W+/).each_with_object(word_hash) { |w, h|
        h[w] = []
    }
    words_in_each_paragraph << word_hash
end

puts words_in_each_paragraph[8]

其中给出了以下输出:

{""=>[], "The"=>[], "above"=>[], "contributions"=>[], "highlight"=>[], "the"=>[], "importance"=>[], "of"=>[], "sophisticated"=>[], "modeling"=>[], "work"=>[], "for"=>[], "a"=>[], "better"=>[], "understanding"=>[], "complexity"=>[], "entrepreneurial"=>[], "space"=>[], "economy"=>[]}

现在,下一步是遍历每个文件,并创建一个动态哈希,为我提供

一个。这篇文章的一个数字。 湾该段的编号。 C。上面列出的单词列表。

为此我需要学习如何动态创建哈希。这是出错的地方:

lines = File.readlines("test.txt")
line_count = lines.size
text = lines.join
paragraph_count = text.split("\.\r").length
puts "#{paragraph_count} paragraphs."

testArray = Array.new(paragraph_count.to_i, Hash.new)
for i in 0...paragraph_count.to_i do
    testArray[i] = Hash.new 
    puts "testArray #{i} has been made"
end
words_in_each_paragraph = Array.new

File.foreach("test.txt", "\.\r") do |paragraph|
    word_hash = {}
    paragraph.split(/\W+/).each_with_object(word_hash) { |w, h|
        h[w] = []
    }
    words_in_each_paragraph << word_hash
    testArray[i][:value] = word_hash
    puts testArray[i] # IT WORKS HERE #
end

puts testArray[1] # AND IT DOESN'T WORK HERE #

此代码在循环内部工作,但不在其外部。循环外testArray返回空,除了最后一个数字,在本例中为testArray[11]

2 个答案:

答案 0 :(得分:1)

您不必逐行扫描即可获取内容,您只需使用each_line功能。

<强> INPUT

This is a test
of putting different lines
into different variables.

<强> CODE

text = File.open("input.txt", 'r')

contents = []
counter = Hash.new(0)

text.read.split(/\\r\\r/) do |paragraph|
  contents << line

  line.split(/\s/).each do |word|
    counter[word] += 1
  end
end

puts contents.inspect
# => ["This is a test\n", "of putting different lines\n", "into different variables.\n"]

puts counter.inspect
# => {"This"=>1, "is"=>1, "a"=>1, "test"=>1, "of"=>1, "putting"=>1, "different"=>2, "lines"=>1, "into"=>1, "variables."=>1}

答案 1 :(得分:1)

Ruby有一些功能可以让这很容易。

我有一个示例文本文件,如下所示:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod

tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,

quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse

cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

运行此:

File.foreach('data.txt', "\n\n") do |paragraph|
  puts paragraph.chomp
  puts '-' * 40
end

结果:

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod
----------------------------------------
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam,
----------------------------------------
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
----------------------------------------
cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
----------------------------------------

所以,Ruby正在查看文件,并将文本块作为段落返回给我。

使用foreach非常重要的是它以块的形式从输入文件返回文本。通常它是逐行进行的,但是,正如我上面所做的那样,它可以返回大量的行,AKA段落,这非常有效,而且速度非常快。 有时我们需要使用“slurping”一次检索整个文件。 readreadlines都提供了这一点,但是“啜饮”文件是不可扩展的;在开发和测试阶段,您可能正在阅读一个小样本文件,但在生产中,您可能会看到多GB文件,当您尝试将整个文件拉入内存时,这些文件可能会破坏机器。在走这条路之前,你必须非常了解主机的资源。人们经常使用readreadlines错误的假设将所有内容都拉到内存中更快,而不是理解现代操作系统和硬件在应用程序看到之前就已将文件存储在多个缓冲区中因此,foreach提供的逐行IO在处理速度上难以分辨。所以,要小心啜饮你的数据。

如果我想将一条线分成它的组成单词,我必须记住删除任何介入的回车和标点符号,然后我可以自由地将该行分成单词。一个简单的方法是告诉Ruby将段落拆分为非单词:

word_hash = {}
File.foreach('data.txt', "\n\n") do |paragraph|
  paragraph.split(/\W+/).each_with_object(word_hash) { |w, h|
    h[w] = []
  }
end

puts word_hash

运行时会产生以下结果:

{"Lorem"=>[], "ipsum"=>[], "dolor"=>[], "sit"=>[], "amet"=>[], "consectetur"=>[], "adipisicing"=>[], "elit"=>[], "sed"=>[], "do"=>[], "eiusmod"=>[], "tempor"=>[], "incididunt"=>[], "ut"=>[], "labore"=>[], "et"=>[], "dolore"=>[], "magna"=>[], "aliqua"=>[], "Ut"=>[], "enim"=>[], "ad"=>[], "minim"=>[], "veniam"=>[], "quis"=>[], "nostrud"=>[], "exercitation"=>[], "ullamco"=>[], "laboris"=>[], "nisi"=>[], "aliquip"=>[], "ex"=>[], "ea"=>[], "commodo"=>[], "consequat"=>[], "Duis"=>[], "aute"=>[], "irure"=>[], "in"=>[], "reprehenderit"=>[], "voluptate"=>[], "velit"=>[], "esse"=>[], "cillum"=>[], "eu"=>[], "fugiat"=>[], "nulla"=>[], "pariatur"=>[], "Excepteur"=>[], "sint"=>[], "occaecat"=>[], "cupidatat"=>[], "non"=>[], "proident"=>[], "sunt"=>[], "culpa"=>[], "qui"=>[], "officia"=>[], "deserunt"=>[], "mollit"=>[], "anim"=>[], "id"=>[], "est"=>[], "laborum"=>[]}

但是,等等,还有更多!通常,当我们想要获取组件单词的列表时,我们想要计算它们的出现次数,或者做类似的事情。我们可以使用group_by使用另一个Ruby技巧:

words = File.foreach('data.txt', "\n\n").flat_map{ |paragraph|
  paragraph.split(/\W+/)
}

puts words.group_by{ |w| w }

结果:

{"Lorem"=>["Lorem"], "ipsum"=>["ipsum"], "dolor"=>["dolor", "dolor"], "sit"=>["sit"], "amet"=>["amet"], "consectetur"=>["consectetur"], "adipisicing"=>["adipisicing"], "elit"=>["elit"], "sed"=>["sed"], "do"=>["do"], "eiusmod"=>["eiusmod"], "tempor"=>["tempor"], "incididunt"=>["incididunt"], "ut"=>["ut", "ut"], "labore"=>["labore"], "et"=>["et"], "dolore"=>["dolore", "dolore"], "magna"=>["magna"], "aliqua"=>["aliqua"], "Ut"=>["Ut"], "enim"=>["enim"], "ad"=>["ad"], "minim"=>["minim"], "veniam"=>["veniam"], "quis"=>["quis"], "nostrud"=>["nostrud"], "exercitation"=>["exercitation"], "ullamco"=>["ullamco"], "laboris"=>["laboris"], "nisi"=>["nisi"], "aliquip"=>["aliquip"], "ex"=>["ex"], "ea"=>["ea"], "commodo"=>["commodo"], "consequat"=>["consequat"], "Duis"=>["Duis"], "aute"=>["aute"], "irure"=>["irure"], "in"=>["in", "in", "in"], "reprehenderit"=>["reprehenderit"], "voluptate"=>["voluptate"], "velit"=>["velit"], "esse"=>["esse"], "cillum"=>["cillum"], "eu"=>["eu"], "fugiat"=>["fugiat"], "nulla"=>["nulla"], "pariatur"=>["pariatur"], "Excepteur"=>["Excepteur"], "sint"=>["sint"], "occaecat"=>["occaecat"], "cupidatat"=>["cupidatat"], "non"=>["non"], "proident"=>["proident"], "sunt"=>["sunt"], "culpa"=>["culpa"], "qui"=>["qui"], "officia"=>["officia"], "deserunt"=>["deserunt"], "mollit"=>["mollit"], "anim"=>["anim"], "id"=>["id"], "est"=>["est"], "laborum"=>["laborum"]}

这是一个很长的列表,但是对于文本中找到的每个独特单词,现在有一个单词出现在该单词中。对数组的简单操作会导致单词的计数按降序排序:

words = File.foreach('data.txt', "\n\n").flat_map{ |paragraph|
  paragraph.split(/\W+/)
}

puts Hash[words.group_by{ |w| w }.map{ |k, v| [k, v.size] }.sort_by{ |k,v| v }.reverse]

看起来像:

{"in"=>3, "ut"=>2, "dolore"=>2, "dolor"=>2, "Excepteur"=>1, "deserunt"=>1, "officia"=>1, "qui"=>1, "culpa"=>1, "sunt"=>1, "proident"=>1, "non"=>1, "cupidatat"=>1, "occaecat"=>1, "sint"=>1, "mollit"=>1, "pariatur"=>1, "nulla"=>1, "fugiat"=>1, "eu"=>1, "cillum"=>1, "esse"=>1, "velit"=>1, "voluptate"=>1, "reprehenderit"=>1, "anim"=>1, "irure"=>1, "aute"=>1, "Duis"=>1, "consequat"=>1, "commodo"=>1, "ea"=>1, "ex"=>1, "aliquip"=>1, "nisi"=>1, "laboris"=>1, "ullamco"=>1, "exercitation"=>1, "nostrud"=>1, "quis"=>1, "veniam"=>1, "minim"=>1, "ad"=>1, "enim"=>1, "Ut"=>1, "aliqua"=>1, "magna"=>1, "id"=>1, "et"=>1, "labore"=>1, "est"=>1, "incididunt"=>1, "tempor"=>1, "eiusmod"=>1, "do"=>1, "sed"=>1, "elit"=>1, "adipisicing"=>1, "consectetur"=>1, "amet"=>1, "sit"=>1, "laborum"=>1, "ipsum"=>1, "Lorem"=>1}

我故意跳过如何为每个个别段落执行此操作,但您可以通过剖析这些部分并将其组合在一起来解决这个问题。进行其他小调整,你应该对你需要的段落内容进行任何分析。


在更新的代码中:

"\."

"\.\r"

没有必要。字符串不需要转义'.',因为它没有特殊含义。而是使用:

".\r"