解析文件夹中的所有文本文件,保存正则表达式搜索周围的文本

时间:2013-03-11 12:07:35

标签: ruby

我正在尝试编写一个迭代目录中所有文本文件的代码,在搜索某些正则表达式的出现时解析它们,并保存前面和后面的20个左右的单词。

我使用dir.glob选择所有.txt文件,然后想要为所有这些文本文件循环一个代码(每个文件),使用正则表达式来搜索单词的出现(line.match?File.find_all ?,然后将单词及其周围的选项打印到基本文件。

我试图将它们拼凑在一起,但我不相信我已经走得很远,也没有进一步。非常感谢任何帮助。

这就是我所拥有的:

    Dir::mkdir("summaries") unless File.exists?("summaries")
    Dir.chdir("summaries")
    all_text_files = Dir.glob("*.txt")

    all_text_files.each do |textfile|
        puts "currently summarizing " + textfile + "..."
        File.readlines(#{textfile}, "r").each do |line|
            if line.match /trail/ #does line.match work?
            if line =~ /trail/ #would this work?
                return true
                #save line to base textfile while referencing name of searchfile
            end
        end
    end

2 个答案:

答案 0 :(得分:2)

您的代码看起来非常草率。它充满了错误。以下是一些(可能还有更多):

你错过了+

puts "currently summarizing " textfile + "..."

应该是:

puts "currently summarizing " + textfile + "..."

您只能在双引号内使用#{},而不是:

File.open(#{textfile}, "r")

只是这样做:

File.open(textfile, "r")

这根本没有任何意义:

File.open(#{textfile}, "r")
textfile.each do line

应该是:

File.open(textfile, "r").each do |line|

这也没有意义:

return true
print line

line永远不会在return true之后立即打印。

编辑:

至于您的新问题:要么有效,要match=~有不同的回报值。这取决于你想要做什么。

foo = "foo trail bar"
foo.match /trail/ # => #<MatchData "trail">
foo =~ /trail/ # => 4

答案 1 :(得分:2)

下面的代码将遍历目录中的每个.txt文件,并将您决定的任何正则表达式的所有出现以及它所找到的文件的名称打印到base.txt文件中。我选择了使用scan方法,这是另一种可用的返回匹配结果数组的正则表达式方法。有关扫描时的rubydoc,请参阅here。如果您只想在每个文件中出现一个问题,也可以更改代码。

##
# This method takes a string, int and string as an argument.
# The method will return the indices that are padded on either side
# of the passed in index by 20 (in our case) but not padded by more
# then the size of the passed in text. The word parameter is used to
# decide the top index as we do not want to include the word in our
# padding calculation. 
#
# = Example
#
#  indices("hello bob how are you?", 5, "bob") 
#      # => [0, 13] since the text length is less than 40
#
#  indices("this is a string of text that is long enough for a good example", 31, "is")
#      # => [11, 53] The extra 2 account for the length of the word 'is'.
#    
    def indices text, index, word
    #here's where you get the text from around the word you are interested in.
    #I have set the padding to 20 but you can change that as you see fit.
    padding = 20
    #Here we are getting the lowest point at which we can retrieve a substring.
    #We don't want to try and get an index before the beginning of our string.
    bottom_i = index - padding < 0 ? 0 : index - padding

    #Same concept as bottom except at the top end of the string.
    top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
    return bottom_i, top_i
end

#Script start.
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")

Dir.glob("*.txt").each do |textfile|
    whole_file = File.open(textfile, 'r').read
    puts "Currently summarizing " + textfile + "..."
    #This is a placeholder for the 'current' index we are looking at.
    curr_i = 0
    str = nil
    #This will go through the entire file and find each occurance of the specified regex. 
    whole_file.scan(/trail/).each do |match|
      #This is the index of the matching string looking from the curr_i index onward.
      #We do this so that we don't find and report things twice.
      if i_match = whole_file.index(match, curr_i)
        top_bottom = indices(whole_file, i_match, match)
        base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
        #We set our current index to be the index at which we found the match so when
        #we ask for the matching index from curr_i onward, we don't get the same index
        #again.
        curr_i += i_match         
        #If you only want one occurrance break here            
      end
    end
    puts "Done summarizing " + textfile + "."
end
base_text.close