Question

我正在编写代码来搜索目录的所有txt文件中的字符串。该代码在3个文件中的2个中正常工作。

search = ['first', 'second', ...] 

Dir["directory/*.txt"].each do |txt|
  file = File.read(txt, encoding: "ISO8859-1:utf-8") 
  search.each do |se|
    puts se if file.include? se  #added to see if it finds a record - not working
    file.each_line do |li|
      if li.include? se
        puts li # I removed everything else to see if this works - not working
      end
    end
  end
end

就像我之前说过的，它适用于2/3文件（80 MB，88 MB，224 MB）。我只留下了目录中的224 MB文件（那个不工作的文件），但仍然没有。

我一整天都在搜索，但找不到能帮到我的东西。如果具有相同的txt格式且来自同一来源，为什么不能在224 MB文件中工作。

修改不工作，因为找不到我知道的字符串，只发生在第三个文件中。

EDIT2：

我做了li.split("\t")并知道li[2]是我知道搜索字符串的列。

然后将代码更改为：

file.each_line.with_index do |li, line|
  data = li.split("\t")
  if line == 3
    puts data[2] #I got in console the string that i'm looking for
  end
# but then when i try to use it I cant
if data[2] == search #this is false i tried change both .to_s or .to_i
 puts li
end

我做了另一个测试，如：

puts data[2].to_i + 1 #result is 1 when data[2] is just numbers

我再次下载了该文件并再次尝试，但似乎没有任何效果。就像它可以返回字符串data[2]但不承认它或不能用它做任何事情。就像我说的那样，只有3个文件中的1个文件。

[编辑] 问题是txt文件是来自源的损坏，几个月后我再次使用新生成的txt文件尝试此代码，这没有任何问题。感谢所有的评论和回答

Answer 1

在使用超出某些内存限制阈值的字符串时，我遇到过类似的问题。

我会尝试将大文件分成更小的块，如下所示：

FILE_SIZE_LIMIT_IN_MB = 80

search = ['first', 'second', ...]

def read_file(path)
  File.open(path, 'r') do |f|
    until f.eof? do
      yield f.read(FILE_SIZE_LIMIT_IN_MB * 1024 * 1024)
    end
  end
end

Dir["directory/*.txt"].each do |txt|
  read_file(txt) do |file|
    search.each do |se|
      puts se if file.include? se  #added to see if it finds a record - not working
      file.each_line do |li|
        if li.include? se
          puts li # I removed everything else to see if this works - not working
        end
      end
    end
  end
end

Answer 2

看起来你正在逐行搜索。如果是这样，您可以通过逐行读取来节省大量内存开销并搜索数组。为此，您将要在读取文件的循环内移动search.each循环。这是我的尝试：

search = ['first', 'second', ...] 

Dir["directory/*.txt"].each do |txt|
  File.foreach(txt, {encoding: "ISO8859-1:utf-8"}) do |li|
    search.each do |se|
      puts se if li.include? se
    end
  end
end

foreach method不会在整个文件中啜饮。

如果搜索字符串跨越换行符屏障，则不起作用。如果您有其他更好的分隔符，您可以选择覆盖默认值：

File.foreach(txt, "\t", {encoding: "ISO8859-1:utf-8"}) do |r| # Tab-separated records

Ruby - 无法在txt文件中找到字符串

2 个答案: